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Abstract 

We define XPathLog as a Datalog-style extension of XPath. XPathLog provides a clear, 
declarative language for querying and manipulating XML whose perspectives are especially 
in XML data integration. 

In our characterization, the formal semantics is defined wrt. an edge-labeled graph-based 
model which covers the XML data model. We give a complete, logic-based characterization 
of XML data and the main language concept for XML, XPath. XPath-Logic extends the 
XPath language with variable bindings and embeds it into first-order logic. XPathLog is 
then the Horn fragment of XPath-Logic, providing a Datalog-style, rule-based language 
for querying and manipulating XML data. The model-theoretic semantics of XPath-Logic 
serves as the base of XPathLog as a logic-programming language, whereas also an equiv- 
alent answer-set semantics for evaluating XPathLog queries is given. In contrast to other 
approaches, the XPath syntax and semantics is also used for a declarative specification 
how the database should be updated: when used in rule heads, XPath filters are interpreted 
as specifications of elements and properties which should be added to the database. 
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1 Introduction 

Logic-based languages have proven useful in many areas since they allow for small, 
declarative, and extendible programs. For the database area, Datalog has been in- 
vestigated for querying and rule-based data manipulation. Extending the Datalog 
idea, more complex logic-based frameworks like F-Logic IjKifer and Lausen 19891 
Kifc r et al. 19 95) , or the languages of the TsiMMlS project UGarcia-Molina et al. 19971 
lAbiteboul et al. 1997(1 have been successfully applied for knowledge representation 
and data integration. The experiences with a powerful language like F-Logic were 
the motivation to have a similar "native" language for the XML world that is much 
simpler than F-Logic, and that is based on the standard XPath language. As a 
result, we present XPathLog as an XPath-based Datalog-style language for query- 
ing and manipulating XML data. By extending XPath with variable bindings and 
providing a constructive semantics for XPath in rule reads, a declarative XML data 
manipulation language is obtained. Since both XPath and rule-based programming 
by using variable bindings are well-known, intuitive concepts, the "effect" of the 
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language is easy to understand on an intuitive basis. Additionally, the well-known 
logic programming semantics provide concise global semantics of such programs 
which coincide with the intuitive ideas. Queries and rules for manipulating and 
restructuring the internal XML database can be expressed much easier than e.g. in 
XQucry (XQu ery 2001| ) (where update functionality is still in a prototypical state). 

Semistructured Data and XML. XML has been designed and accepted as the frame- 
work for semi-structured data where it plays the same role as the relational model 
for classical databases. The XML data model applies both to documents and to 
databases: The SGML language was originally defined for documents in the pub- 
lishing area. On the other hand, the interest in research on semistructured data 
in the 1990s 1 (e.g., F-Logic l|Kifer and Lausen 19891 IKifer et al. 1995)1 . GraphLog 
(|Oonsens and Mendelzon 19 90). UnQL l|Buneman et al. 1996llBuneman et al. 2000)1 . 
TSIMMIS (|Garcia-Molina et al. 19971 lAbiteboul et al. 1997|l with the OEM data 
model and the MSL, WSL, and Lorel languages, Strudel/StruQL l)Fernandez et al. 19971 
IFernandez et al. 1998|l . and YAT/YATL IjGluet et al. 1999)1 ) was motivated by the 
database community, searching for a data model for data integration and a data 
format for electronic data interchange. Here, also the combination of document- 
oriented aspects with database aspects was an important motivation to go beyond 
classical data models which then resulted in the design of XML. 

The XML data model is a hierarchical model which defines an ordered tree with 
attributes that can easily be interpreted as a document. The natural relationships 
in documents are either (i) substructures, or (ii) references to other parts of the 
document (where the term reference here means simply a cross-reference in a doc- 
ument). The nested elements define a document structure whose leaves are the text 
contents. Elements (i.e., the structuring components) are annotated by attributes 
which do not belong to the document contents. Inside the tree, cross-references 
(IDREF attributes) are allowed. 

In contrast, for databases, a hierarchical structure is in general not intuitive. 
Here, several kinds of relationships have to be represented, between substructures 
and pure references. When using XML for a database-like application, these rela- 
tionships have to be represented by reference attributes. On the other hand, order 
is often not relevant in databases. 

Mainstream XML Languages. Specialized languages have been defined for XML 
querying, e.g., XQL IRo hie 199911 . XML-QL I) I Putsch et al. 1999)1 . then XPath IX Path 199911 
developed from the experiences with XQL and XSL Patterns (and XPointer) as an 
addressing language. XQuery (XQuery 2001) extends XPath with SQL- like con- 
structs. XSLT l)XSLT 1999)1 is an XPath-based language for transforming XML 
data. A proposal for extending XQuery with update constructs (XUpdate) has 
been published in (Tat arinov et al. 2 001): a more detailed proposal is described in 
l)Leht.i 2001)1. 



We list the approaches in the temporal order of their presentation. 
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Other Approaches to XML. XML-GL IjCeri et al. 19991 IComai et al. 2001jl contin- 
ued the idea of GraphLog for XML. Elog ( |Baumgartner et al. 2001a| ) is based on 
Datalog and classical first-order logic, flattening XML into predicates. It is used as 
an internal language in Lixto ( |Baumgartner et al. 2001b| ) . < |Bry and Schaffert 2002| ) 
present the Xcerpt language which regards XML trees as terms, similar to UnQL. 

1 . 1 Comparison of Design Concepts 

Amongst the existing languages for handling semistructured data and XML, there 
are several facets for distinguishing them in terms of the concepts they use. A more 
detailed comparison with individual languages can be found in Sectional 

Data model. Semistructured data can be regarded as a general graph (OEM, UnQL, 
Strudcl, GraphLog, and F-Logic) or as a tree (YATL and XML). Moreover, node- 
labeled graphs/trees (XML) or edge-labeled graphs (as in Strudel, UnQL, GraphLog, 
and F-Logic) can be distinguished; for OEM both representations can be found. It 
is easy to represent a node-labeled instance in a labeled model, whereas the other 
way requires node replication. Also, ordered (e.g., XML) and unordered (e.g. in 
OEM, F-Logic, and UnQL) tree/graph models are distinguished. 

Access mechanism. Generally, there are two approaches for selecting items in a 
semistructured data tree or graph: 

• by matching patterns and templates (GraphLog, MSL/WSL, UnQL, YATL, 
and later for XML-QL and XML-GL). In UnQL and Xcerpt, (bi)simulation 
between semistructured data trees is used. If a simulation of a match pattern 
with variables by the underlying database is found, the appropriate variable 
bindings are returned and used for generating an answer tree. 

• navigational access, like in object-oriented database languages (OQL), as done 
in Lorel, StruQL, and F-Logic, and later in XPath and also in our XPathLog 
approach. UnQL provides both patterns and a navigational syntax. 

Functionality. There are different approaches to either generating an answer by 
instantiating a generating pattern in the rule head according to the variable bindings 
(UnQL, StruQL, and later XML-QL, XQuery and XML-GL), or manipulating the 
underlying structure by adding information to the underlying database (GraphLog, 
F-Logic and XUpdate). 

Note that this distinction did not exist when considering classical rule-based 
languages, e.g., Datalog for predicate logic. The facts derived in the rule head were 
added to the database - either extensionally, or intensionally as view definitions - 
without directly interfering with the already stored facts. Other rules of the program 
could easily use both the original data and the derived data. For XML, a semantics 
where rules generate separate structures is easy to define. In contrast, a semantics 
where the rule heads interfere with the database contents has to take into account 
that the evaluation of rules may violate the tree structure. 
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Underlying Framework. Some of the languages are based on a kind of model- 
theoretic semantics: UnQL and Xcerpt directly operate on tree-term structures, 
employing and defining mechanisms like tree matching, term unification and (bi)si- 
mulation unification. Elog is based on Datalog and classical first-order logic, flat- 
tening XML data into predicates. F-Logic defines F-Structures that extend classical 
first-order logic, and then applies logic programming mechanisms to such models. 
For the other languages (Tsimmis, Strudel, XML-QL, XQuery, XUpdate), the se- 
mantics is directly defined in terms of data structures. 



Rule-based vs. Logic Programming. All languages discussed above are rule-based 
and declarative, generating variable bindings by a matching/selection part in the 
"rule body" and then using these bindings in the "rule head" for generating output 
or updating the database. This rule-based nature is more or less explicit: F-Logic, 
MSL/WSL (Tsimmis), Elog, and Xcerpt use the ":-" Prolog syntax, whereas UnQL, 
Lorel, StruQL, XML-QL and XQuery/XUpdate cover their rule-like structure in 
an SQL-like clause syntax. These clausal languages allow for a straightforward 
extension with update constructs (as it has been done for Lorel and proposed with 
XUpdate for XQuery). GraphLog and XML-GL use a graphical representation. 

The first, "pure" group separates strictly between the selection part in the rule 
body and generation/update part in the rule head, whereas UnQL, StruQL, XML- 
QL and XQuery allow for nested selection-generation parts in the rule bodies. 

The global semantics of these languages is influenced by their functionality, dis- 
tinguishing between query/transformation and query/update languages: UnQL, 
Xcerpt, XML-QL, and XQuery generate (output) structures in their head which 
are not feeded back into the input or internal database. 

Only MSL/WSL, Elog, and F-Logic allow to for additions to the database or view 
definitions (depending whether bottom-up or top-down semantics is considered), 
and to use the derived facts in the selection/matching part of other rules. StruQL, 
XML-QL, and XQuery overcome this restriction by nesting selection-generation 
parts in the rule bodies. The traverse construct of UnQL (applying a subquery by 
structural recursion to arbitrary depth) also comes near to local view definitions. 
Note that these languages require regular path expressions to compute the transitive 
closure of a binary relation (see IjFernandez et al. 1997|) ^ . We consider the ability 
to compute a transitive closure as an important feature for a language for handling 
semistructured data (especially, for a "logic-programming" language, since that is 
what makes the distinction between Datalog and the relational algebra/calculus). 

The difference between rule-based transformation languages and logic program- 
ming languages is mirrored by the fact that the semantics of UnQL, Xcerpt, XML- 
QL, and XQuery is completely given by he semantics of their rules (qualifying them 
as rule-based languages). In contrast, the global semantics of F-Logic and Elog also 
requires the notions of the Tp operator and of minimal or well-founded models 
(qualifying them as logic programming languages). As a consequence, they require 
both a model-theoretic semantics, and an answer semantics for queries. 
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Design Principles for XPathLog. XPathLog follows a logic-based approach which 
has been motivated by the experiences with F-Logic: XML instances are mapped to 
a semantical structure for interpreting XPath-Logic formulas. XPath-Logic is based 
on (i) first-order logic, and (ii) XPath expressions extended with variable bind- 
ings. The Horn fragment of XPath-Logic, called XPathLog, provides a declarative, 
Datalog-style logic-programming language for manipulation of XML documents. 
Regarding the above design principles, XPathLog is positioned as follows: 

• XPathLog is completely XPath-based (i.e., navigational access). Matching 
and generation/update part are strictly distinguished. 

An extended XPath syntax is used for querying (rule bodies) and gener- 
ating/manipulating data (rule heads). The rule body serves for generating 
variable bindings which are then used in the head for extending the current 
XML database, thereby defining an update semantics for XPath expressions. 

• XPathLog uses an edge-labeled graph model, which is advantageous when 
defining several tree views of the internal database. The data model is partly 
ordered like in XML: subelements are ordered, attributes are unordered. 

• XPathLog is a logic-programming language according to the above character- 
ization. It works on an abstract semantical model which represents an XML 
database supporting multiple overlapping XML trees. These X-Structures to- 
gether with the logic, called XPath-Logic, provide for a logical characterization 
of XML data. XPathLog combines the intuitive "local" semantics of address- 
ing XML data by XPath with the appeal of the "global" semantics of logic 
programming. As an update language, it is based on a bottom-up semantics. 

• In contrast to XML-QL, XQuery, and XSLT, the language does not use ad- 
ditional constructs whose semantics has to be defined separately: the only 
semantic prerequisite is the bottom-up evaluation strategy of Datalog or any 
other logic programming language. 

In this paper, we describe the data model, its logical foundation, the internal seman- 
tics of queries, rules, and programs of XPathLog as a true logic programming lan- 
guage for XML. Some aspects have been published in ( May 2002 ; M ay and Behrends 200 1\ 
and with the LoPiX prototype in jMay 200 lc\ . Here, we focus on the theoretical 
issues of modeling XML and the semantics of a language for queries and basic, 
elementary updates. A full report on XPathLog can be found in ( |May 200 la| ). 

A possible application area for XPathLog is e.g. the integration of XML data 
from several sources as done in the case study | |May 200 lh\ . Here, the power of the 
combination of XPath expressions with additional variable bindings allows for short 
and concise declarative and flexible rules. Both, queries and rules for manipulating 
and restructuring the internal XML database can be expressed much easier than 
e.g. in XQuery (where update functionality is still in a prototypical state). 

Structure of the paper. Section defines X-Structures as semantical structures 
which represent XML documents and presents XPath-Logic. The answer seman- 
tics of XPathLog as an XML query language is investigated in Section OH Section 0] 
defines the semantics of rule heads for generating and modifying XML data, and 
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the semantics and evaluation of XPathLog programs. The implementation in the 
LoPiX [Logic Programming in XML) system and a case study are described in 
Section[5] An analysis, a general discussion of related work, and the conclusion can 
be found in Section Additional proofs can be found in Appendix | Appendix A| 



2 XPath-Logic: The Model-Theoretic Framework 

XPath-Logic and its Horn fragment, XPathLog, extend XPath l|XPath 1999|> with 
Datalog-style variable bindings. XPath-Logic provides the model-theoretic frame- 
work for defining a global, logic-programming style semantics for XPathLog. 

XPath l|XPath I999|) is the common language for addressing node sets in XML 
documents. It is based on navigation through the XML tree by path expressions 
of the form root/ axisStep/ .../ axisStep where root specifies a starting point of the 
expression (the root of a document, or a variable that is bound to a node in an XML 
instance). Every axisStep is of the form axis::nodetest[qualifier]*. The axes define 
navigation directions in an XML tree: Given an element e, the child axis contains all 
its children and the attribute axis contains all its attributes. Analogously, parent, 
ancestor, descendant, preceding-sibling and following-sibling axes are defined. They 
enumerate the respective nodes by traversing the document tree starting from e. 

First, along the chosen axis, all elements which satisfy the nodetest (which spec- 
ifies the nodetype or an elementtype which nodes should be considered) are se- 
lected; the resulting list is called the context. Then, the given qualifier(s) (also 
called filters) are applied to each of the nodes (as the context node) for finer se- 
lection. Inside qualifiers, relative location paths are allowed that implicitly start at 
the context node. Starting with this (local) result set, the next step expression is 
applied (for details, see IjXPath 1999|l or flXQFS 2001] ). The most frequently used 
axes are abbreviated as path/ nodetest for path/ch\\d::nodetest, path/ ©nodetest for 
pat/j/attribute:: nodetest, and path/ / nodetest for par/j/descendant-or-self::*/child::r)oc/etest. 

Example 1 (XML, XPath, Result Sets) 

Consider the of the Mondial database (Ma y 200 le| ) for illustrations; the DTD is 
given as follows: 

<!ELEMENT mondial (country+, organ ization+, . . . )> 

<!ELEMENT country (name, population, encompassed+, border*, city+, . ..)> 
< I ATTLIST country car.code ID #REQUIRED capital IDREFS #REQUIRED 
memberships IDREFS #IMPLIED> 
<!ELEMENT name (#PCDATA)> 
<!ELEMENT encompassed EMPTY> 

<!ATTLIST encompassed continent CDATA #REQUIRED 

percentage CDATA #REQUIRED> 

<!ELEMENT border EMPTY> 

< I ATTLIST border country IDREF #REQUIRED length CDATA #REQUIRED> 
<!ELEMENT city (name, population*)> <!ATTLIST city country IDREF #REQUIRED> 
<!ELEMENT population (#PCDATA)> <!ATTLIST population year CDATA #IMPLIED> 
<!ELEMENT organization (name, abbrev, members+)> 
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<!ATTLIST organization id ID #REQUIRED headq IDREF #IMPLIED> 
<!ELEMENT abbrev (#PCDATA)> 
<!ELEMENT members EMPTY> 

<!ATTLIST members type CDATA #REQUIRED country IDREFS #REQUIRED> 

An excerpt of the instance is given below (and depicted as a graph can be found in 
Figure ^when X-Structures are considered). 

<country car_code="B" capital= "cty-Brussels" memberships^ "org-eu org-nato . . ."> 
<name>Belgium</name> <population>10170241</population> 
<encompassed continent^ "Europe" percentage= "100" /> 

<border country= "NL" length= "450" /> <border country= "D" length= "167" /> 

<city id= "cty-Brussels" country="B"> <name>Brussels</name> 
<population year= "95" >951580</ population> 

</city> 

</country> 

<country car_code="D" capitals "cty-Berlin" memberships^ "org-eu org-nato . . ."> 
</country> 

<organization id= "org-eu" headq= "cty-Brussels" > 

<name>European Union</name> <abbrev>EU</abbrev> 
<members type= "member" country^ "GR F E A D I B L . . . " /> 
<members type= "membership applicant" country='AL CZ ..."/> 

</organization> 

organization id= "org-nato" headq= "cty-Brussels" ...> 

</organization> 

The XPath expression 

/ /country[name]/city[population/text()> 100000 and Ozipcode]/ name/text() 

returns all names of cities such that the city belongs (i.e., is a subelement) to 
a country where a name subelement exists, the city's population is higher than 
100000, and its zipcode is known. 

XPath is only an addressing mechanism, not a full query language. It provides the 
base for most XML query languages, which extend it with their special constructs 
(e.g., functional style in XSLT, and SQL/OQL style (e.g., joins) in XQuery). In the 
case of XPath-Logic and XPathLog, the extension feature are Prolog/Datalog style 
variable bindings, joins, and rules. 

Remark 1 (Relationship to W3C Documents) 

We restrict the considerations to the core concepts of XPath as an addressing and 
navigation formalism for XML data, i.e., stepwise navigation along the axes and 
step qualifiers/filters. For the XPath syntax and non- formal semantics, we always 
refer to the W3C XPath 2.0 Working Draft (|XPath 1999jl . Note that the syntax 
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and semantics of the core concepts of XPath is the same as in XQL URobie 1 999 1. 
XPointer, early drafts of XPath, XPath 1.0 IjXPath 1999|) . although both the pre- 
sentation and the naming have been changed several times. 

A formal semantics of XPath has been given as a denotational semantics in 
(Wadler 1999) that already covers these central notions of XPath. Later, it has 
been re-formulated first in the W3C XML Query Algebra < |XMQ-A 20010 and then 
in the W3C Query Formal Semantics (fXQFS 2001 J where a description in terms of 
type inference rules and value inference rules is given. Note that since the early im- 
plementations (e.g., xt IjClark 1998|) ). the actual semantics of XSL Patterns/XPath 
as its "behavior" has not changed. For comparing our approach with the formal 
semantics of XPath, we refer to ll Wadler 1999|) which gives a short and concise 
definition of the central concepts that is best suited as a reference. 

2. 1 Syntax of XPath-Logic 

Inspired by the derivation of F-Logic from first-order logic as a logic for dealing 
with structures containing complex objects, XPath-Logic is defined for expressing 
properties of XML structures. The main difference between XPath-Logic and first- 
order logic is that XPath-Logic has an additional type of atomic formulas: reference 
expressions which turn out to be a special kind of predicates with a built-in seman- 
tics. The "basic" components of the language are XPath-Logic PathExpressions 
which are syntactically derived from XPath's PathExpressions by extending Path 
with Prolog/Datalog style variable bindings. 

Definition 1 {XPath-Logic: Syntax) 

The set of basic formulas of an XPath-Logic language is defined as follows: 

• every language contains an infinite set Var of variables. 

• a specific XPath-Logic language is given by its signature £ of element names, 
attribute names, function names, constant symbols, and predicate names. 

• XPath-Logic reference expressions over the above names extend the XPath 
path expressions: The syntax of AxisSteps, axis::name[stepQualifier\* , may be 
extended to bind the selected nodes to variables by "-> Var": 

Step ::= Axis NodeTest StepQualif iers 

I Axis NodeTest StepQualif iers "->" Var StepQualif iers 

I Axis Var StepQualif iers 

I Axis Var StepQualif iers "->" Var StepQualif iers 

For an XPath-Logic reference expression, the underlying XPath expression is 
obtained by removing the inserted variable binding constructs. 

• An XPath-Logic predicate is a predicate over reference expressions. 

• terms and atomic formulas are defined analogously to first-order logic. 

• XPath-Logic compound formulas are built over predicates and reference ex- 
pressions, using A, V, -i, 3, and V. 

• XPath-Logic allows to have formulas in step qualifiers. 
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Note that XPath-Logic does not use the explicit dereference operator => from XPath 
2.0; instead implicit dereferencing of attributes in paths is supported. 2 
The goal of the paper is to introduce the Horn fragment from XPath-Logic, called 
XPathLog as a Datalog-style XML query and manipulation language. The following 
example gives some XPathLog queries that review the basic XPath constructs, and 
illustrate the use of the additional variable binding syntax. 

Example 2 (XPathLog: Introductory Queries) 

The following examples are evaluated against the Mondial database. 

Pure XPath expressions: pure XPath expressions (i.e., without variables) are 
interpreted as existential queries which return true if the result set is non-empty: 
?- //country[name/text() = "Belgium"]//city/name/text(). 
true 

since the country element which has a name subelement with the text contents 
"Belgium" contains at least one city descendant with a name subelement with 
non-empty text contents. 
Output Result Set: The query "?- xpath^N" for any XPath expression xpath 
binds N to all nodes belonging to the result set of xpath: 

?- //country[name/text() = "Belgium"]//city/name/text()— >N. 

N/ "Brussels" 

N/ "Antwerp" 

respectively, for a result set consisting of elements, logical ids are returned: 
?- //country[name/text() = "Belgium"]//city^C. 
C / brussels 
C/antwerp 

Additional Variables: XPathLog allows to bind all nodes which are traversed by 
an expression: The following expression returns all tuples (N\,C,N2) such that 
the city with name N2 belongs to the country with name N\ and car code C: 
?- //country[name/text()^Nl and @car_code^C]//city/name/text()— >N2. 
N2/"Brussels" C/"B" Nl/"Belgium" 
N2/"Antwerp" C/"B" Nl/"Belgium" 

N2/"Berlin" C/"D" Nl/"Germany" 

Dereferencing IDREF Attributes: For every organization, give the name of 
city where the headquarter is located and all names and types of members: 

?- //organization[name/text()^N and abbrev/text()^A and 
@headq/name/text()^SN] 

/members[@type^MT]/@country/name/text()^MN. 

2 XPath-Logic has been designed before XPath 2.0 replaced XPath's id(.) function by the deref- 
erencing operator. Furthermore, we use a data model that directly incorporates references. 
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One element of the result set is e.g., 

N/". .." A/"EU" SN/ "Brussels" MT/ "member" MN/ "Belgium" 
Schema Querying: The use of variables at name positions allows for schema 
querying, e.g., to give all names of subelements of elements of type city: 
?- //city/|SubEIName . 



SubEIName/name 
SubEl Name/population 

Navigation Variables: Search for all things that have the name "Monaco". More 
explicitly, give the element type of all elements that have a name subelement with 
the text contents "Monaco" : 



?- // Type -^X[name/text()^ "Monaco"]. 
Type/country X/ country-monaco 
Type/city X/ city-monaco 

Closed XPath-Logic formulas can e.g. be used for expressing integrity constraints. 
Example 3 (Integrity Constraints) 

There are some application-specific integrity constraints on the Mondial database: 

Range restrictions: The text contents of population elements and the value of 
area attributes must be a non-negative number: 

V X: ((//population/text()->X or //Oarea^X) X > 0). 

The sum of percentages of ethnic groups in a country is at most 100%: 

V C: (//country^C ~» sum{N [C]; C/ethnicgroups/@percentage^N} < 100). 
Bidirectional relationships: The membership of countries in organizations is 

represented bidirectionally: 

V C,0: ( //country^C[@memberships— »0] <-> 

3 T: //organization^O/members[@type^T and @country-^C]). 
Other conditions: The country attribute of border subelements of country ele- 
ments must reference a country which is encompassed by the same continent: 

V C,C2: (//country^C/border[@country^C2] ~» 

( //country^C2 and 3 Cont: ( C/encompassed/@continent^Cont and 

C2/encompassed/@continent^Cont))). 



2.2 XML Instances as Semantical Structures 

Next, we need a basis for a model-theoretic semantics of XPath-Logic. The in- 
formation that is carried by an XML instance is abstractly defined in the XML 
Information Set specification iXMLInf 1999). It can be represented in different 
ways - e.g., as the human-readable ASCII-based notation, or by using the DOM 
l|DOM-W3C 19 98) that provides an abstract datatype for implementations. There 
are approaches that regard XML trees as database items where the languages oper- 
ate on (UnQL, Xcerpt). In our approach, the atomic items are the edges of a graph 
(than can be an XML tree, but that can also represent overlapping tree views on an 
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internal graph-like database), called XTreeGraph. In contrast to the DOM model 
and the XML Query Data Model (XMQ-D 200T| | which use a node-labeled tree 
(i.e., the element and attribute names are associated with the nodes), the XTree- 
Graph is an edge-labeled model. Using an edge-labeled model proves useful for data 
manipulation and integration (see (Ma y and Behrends 200l| ). Recall that XML- 
QL IjDeutsch et al. 1999) also uses an edge-labeled graph which especially defines 
the same handling of text contents as ours; influenced by the experiences with the 
Strudel/StruQL ({Fernandez et al. 1998|> project for data integration. 

Formally, the XTreeGraph is represented by an X-Structure (that interprets a 
signature consisting of element and attribute names, similar to a first order struc- 
ture). The advantage of that approach is that it allows for manipulating an internal 
database by adding edges to the graph. Thus, XPathLog is not only a query lan- 
guage, but also a manipulation language. Its rule heads do not necessarily construct 
new XML trees/terms, but can update the X-Structure. As a prerequisite for map- 
ping XML instances to X-Structures, some notation for handling lists is needed: 

Notation 1 (Lists) 

Throughout this work, the following usual notation is used: 

• For two sets A and £?, the set of mappings from A to B is denoted by B A . 

• A list over a domain D is a mapping from IN to D. Thus, the set of lists over 
D is denoted by D m . 

• the empty list is denoted by e; a unary list containing only the element x is 
denoted by (cc) ; list concatenation as an operator is denoted by o . 

• set(expri(xx, . . . , x n ) \ expr2(xi, . . . , x n )) stands for 

{expri(xi,...,x n ) | expr 2 {,xi, . . . ,x n )} 

(i.e., the set of all expr%(xi, . . . , x n ) such that the condition expr<z(xx, . . . , x n ) 
holds). In the following, sets are sometimes used as lists exploiting the fact 
that a set can be seen as a list by an arbitrary enumeration. 

• In a similar way, a list can be constructed by enumerating its elements. For 
a list £ — (ii,Z2,...), WstiQi(expri(i) \ expr2(i)) is the list of all expri(ij) 
where expri(ij) holds. Similar to list, concaU e i(expri(i) \ expr2(i)) does 
the same if expr\(i) is already a list. 

• For a finite list I = (x%, . . . ,x n ), reverse(£) = (x n , . . . ,xi). 

• For a list £, denotes the sublist that consists of the ith to jth elements, 

• For a list £ of pairs i.e., I — ((ari, yi), (^2,2/2), ■■■),$■ li denotes the projection 
of the list to the first component of the list elements, i.e., I \\:= {x\, X2, ■ ■ ■)■ 

X-Structures. When representing XML instances as X-Structures, (i) their ele- 
ments/subelement structure, and (ii) the elements' attributes have to be repre- 
sented. The universe consists of the element nodes of the XML instance and the 
literals used as attribute values and text contents. Element nodes have proper- 
ties, defined by (i) subelements (which are ordered) and (ii) attributes (which are 
unordered). Multivalued attributes (NMTOKENS and IDREFS) are silently split, and 
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reference attributes are silently resolved. Additionally, X-Structures support named 
constants and predicates as known from first-order logic. 

Each XML instance is represented as a structure with a universe U over a signature 
E = (SjV) Ep, Ec, Ep) which consists of 

• Sjv: element names and attribute names, 

• T,p: names of XML-built-in functions, 

• Ec: constant symbols, denoting elements in the XML instance (e.g., germany, 
interpreted as the element addressed by /mondial/country[name= "Germany"]). 

• 12p: predicates (with arity). 

• Additionally, a basic set of literal constants is assumed. 

An X-Structure contains only the basic facts about the XML tree, i.e., the child 
and attribute relationships (similar to the DOM). Note that our approach which 
associates the order with the children of elements, differs from e.g., the DOM and 
XML-QL approaches where a global order of all elements is assumed. 

Definition 2 (X-Structure) 

An X-Structure over a given signature E is a tuple X — (V,£,M,T,£,A) where 
the universe U consists of three sets V, C, and M: V is a set of nodes (from the 
graph point of view, vertices), identified by internal names, £ is a set of literals 
(integers, floats, strings), M is the set of names (as e.g., occurring in node tests). 
Names may be further distinguished into Me , containing the element names (and a 
special element text() for handling text children), and Ma containing the attribute 
names. 

• X is a (partial) mapping, which interprets the signature: Xe : Ejy — * Ms and 
I4 : E^ — > Ma interpret the names in E^v by element and attribute names. 
Xq '■ Ec — > V interprets the constant symbols in Ec by nodes in V, and 

X F : VxEp x (VU£UA/")* — » VUCuM represents the interpretation of built-in 
functions (as defined in flXPQOF 2001) )). Finally, X P : E P x (VU£UA0* -> 
{true, faise} represents the interpretation of predicates. 

• £ is a (partial) mapping £ : V x IN x Ms VU£ (subelement relationship 
and text contents; from the graph point of view, an ordered set of edges). 

• A is a (partial) mapping A: V x Ma — » 2 V U 2 C (attribute values). XML 
attribute nodes do not belong to V, but their literal values belong to L. For 
reference attributes (IDREF), the "results" are not the ID-strings, but the 
target nodes in V themselves. 

Note that £ and A are not direct interpretations of E, but mappings that "inter- 
prete" M . E is mapped to M before being interpreted by X, making attribute and 
element names full citizens of the language (as, e.g., in F-Logic). 

There is a canonical mapping from the set of XML instances to the set of X- 
Structures. The canonical X-Structure to an XML instance is a single XML tree 
(cf. Figure nj, covering the DOM model. 
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mondial 



textQ 




"Belgium" 10170241 



en-name 




eu-abbrev 




eu-meml 



textQ 



textQ 



>tyj?fe^= "member'^/ 
©country^* 
@country=»^ 



"Europ. Union " "EU" 

Fig. 1. Example X-Structure 



Example 4 

Figure ^ shows the X-Structure of the running example given in Example ^ 

The elements of V (representing the element nodes) do not carry information in 
themselves, they are only of interest as anonymous entities (similar to object ids) 
which have certain properties that are given by £, A, and 2. In the following, 
mnemonic ids (e.g., germany) are used for elements of V. Also, S^r is identified 
with Ms and ACt, omitting Xg and I a. 

In full generality, an X-Structure can also contain subelement edges and refer- 
ence edges which are not conforming with the XML tree model, but which are 
crucial for data integration: An element may be a subelement of several other el- 
ements (as we will show in Section 14.21 and Figure , even with different names 
of the subelement relationship ("overlapping trees" - thus, the term XTreeGraph 
(May and Behrends 200 lj ) for the abstract data model). 

In the following, X-Structures serve for defining a semantics for XPath-Logic, 
using the same terms as for XPath. 

Definition 3 {Basic Result Sets: Axes) 

For every node x in an X-Structure X and every axis a as defined in XPath, 

A x (a,x) e ((VU£) x A0 W 

is the list of pairs (value, name) generated by axis a with x as context node (do 
not confuse Ax with A which denotes the interpretation of attributes in X). 
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Ax (child, x) := listjgiN ((y, name) | £(x, i, name) = y) 

Ax (attribute, a;) := list((y, name) \ y £ A(x, name)) by some enumeration. 

For the other axes, Ax(a,x) is derived according to the XPath specification: 

A* (parent, a;) := set((p,X(p, name, ())) | x £ Ax (child, p) ji) 

A* (preceding-sibling, x) := 

concat J , e ^( paretl t,x)ii(reverse(^A'(child,p)[l,i-l]) | x = Ax{c\\\\6,p)[i\) 
^^(following-sibling, x) := 

concat pe ^,( paretl t,x)ii(A* (child, p)[i+l, last()] | x = A*(child,p)[i]) 
^(ancestor, x) := concat( Pi „) e ^(p arent;X )(((p, n)) o A x (ancestor, p)) 

A* (descendant, a;) := concat( Ci „) e ^( chi id,x)((c, n) o At (descendant, c)) 

Remark 2 

Recall that in the node-labeled XML/XPath data model, the semantics of expres- 
sions is always a list or a set of labeled nodes. In contrast, Ax does not return 
a list of (labeled) nodes, but a list of pairs {node /literal, name), according to the 
edge-labeled data model underlying our approach. 

2.3 Semantics 

The semantics of XPath-Logic is defined similar to that of first-order logic by induc- 
tion over the structure of expressions and formulas. The main task here is to define 
the semantics of reference expressions, handling navigation, order, and filtering. A 
reference expression simultaneously acts as a term (it has a result (list) and can be 
compared to terms) and as a predicate (when used in a step qualifier). 

The basic result lists are provided by Ax (axis, v) for every node v of X; recall that 
Ax (attribute, x) contains literals in case of non-reference attributes, and element 
nodes in case of reference attributes. 

2. 3. 1 Semantics of Expressions 

As for first-order logic, a variable assignment (3 : Var — > U maps variables to ele- 
ments of the universe U (nodes, literals, and names) of the underlying X-Structure. 
For a variable assignment (3, a variable x, and d GU, the modified variable assign- 
ment 3% is identical with 3 except that it assigns d to the variable x: 

3 d : Var - U : { V ^ ' xiy + x ' 

x ' \ x i— ► d otherwise. 

For 3 as above, and a variable x, 3 \ {x} denotes 3 without the mapping for x. 

Expressions are decomposed into their axis steps. Every step consists of choosing 
an axis, preselecting nodes by a node test, and filtering the result by (i) "normal" 
predicates and (ii) XPath context functions (e.g., positionQ and last()) which use 
the order of the intermediate result list for selecting a certain element by its index. 
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Definition 4 (Semantics of XPath- Logic expressions) 

The semantics is defined by operators S and Q which are derived from the formal 
semantics given in (Wadler 1999). 

• Sx ■ Reference-Expressions^ (VU£U7V) IN , and 

(Axes x Reference_Expressions x V x Var_Assignments) — > (V U C U N)^ 
evaluates reference expressions wrt. an axis, a context node, and a variable 
assignment and returns a result list. In the second case, we use any to denote 
that the actual value of the node does not matter, and we use S any to denote 
that the actual value of axis does not matter. 

• Qx- (Predicate-Expressions x V x Var .Assignments) — > Boolean 
evaluates step qualifiers wrt. a context node and a variable assignment. 

Reference Expressions are evaluated by S: 

1. For closed expressions, Sx(refExpr) = S x ny (refExpr,any,$). 

2. Reference expressions are translated into path expressions wrt. a start 
node: 

• rooted paths: S x ny (Jp,any, (3) = S x ny (p,root, f3) where root is as 
follows: 

* the unique root node if only one XML document is currently 
stored, 

* the root node that has been used in the outer expression, if /p 
occurs in an expression of the form path[/p\. 

• rooted paths in other documents: 

iS^.™ y (document( ' http: //. . -\ )/p, any, (3) = S x ny (p, root,f3) 
where root is the root node of the document stored at http:// 

• entry points specified by a constant c: S x ny (c/p, any, (3) = S a x y (p, Xq (< 
(this is mainly of interest when multiple documents are used and con- 
stants are associated with their roots or some nodes, see Section l4~2")l . 

• entry points specified by variables v £ Var: S x ny [v /p,any, (3) — 
S a x ny {p,f3{v),(3) 

3. Axis steps: S x ny (axis :: pattern, x, (3) = S x xls (pattern, x, (3) 

where pattern is of the form nodetest remainder where remainder is a 
sequence of step qualifiers and variable bindings. These are evaluated 
left to right, always applying the rightmost "operation" (step qualifier 
or variable) to the result of the left part: 

4. Nodetest: S x (name,x, (3) — \\st^ vn ) e j( x ^ ax )(v\n — name) 

S x {node{),x,(3) = \ist {v , n )eA x (a,x)(v \ v e V) 
S x (text(),x,(3) = list (t , )n ) e ^( a>x) (« | v G C) 

S X {N,X,(3) = V\S\.{v,n)£A x (a,x){v \ n = f3(N)) 

5. Step with variable binding: 

Slattern — > V, x, (3) = ( ^ * W>> & *' ® 

XK ' \ e otherwise. 
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6. Step qualifiers: 

S x (pattern[stepQ],x,f3) = \\st yeS a^ paUern x ^(y \ Q x (stepQ,y, /3po S j g ize )) 

where L\ := S x (pattern, x, (3) and n := size(L{), and for every y, let j 
the index of y in Li, fc := j if a is a forward axis, and k := n+l-j if a 
is a backward axis (cf. i|Wadler 1999j> ). Pos and Size are only used if 
the step qualifier contains a context function. 

7. Path: S x (p%/p2, x, (3) = concatygga ( Pl )Xi| g) (s a x ny ( P2 ,y,(3)) 

Step Qualifiers are evaluated by Q: 

8. Reference expressions have an existential semantics in step qualifiers: 

Q x (refExpr,y,f3) :<=> S a x y (refExpr,y, (3) ^ % 

9. Predicates (including comparison predicates): The semantics of predi- 
cates in XPath is element- oriented: p(refExpr,term) evaluates to true if 
at least one pair taken from the result sets of refExpr and term satisfies 
the predicate p (either defined in Zp, or a built-in predicate of XPath 
JXTQOF 200lt ): 

Q x {pred{expr ll . . . ,expr n ), y,f3) :<^> 

there are x\ £ S x ny (expri, y, 0), . . . , x n £ S a x y (expr n , y, (3) 
such that (x\, . . . , x n ) G Tp(pred) 

10. Boolean Connectives and Quantification are defined as usual. 

Evaluation of Terms. 

11. Constants c G Sc : ^ x (c, x, (3) = Zc(c). For literals, S x (lit, x, (3) = lit. 

12. Variables: S x (var,x, (3) = (3(var). 

13. Functions and arithmetics are also defined element-wise: 

S x {f(expr 1} expr n ), x, (3) = 

{I F (j3(x),f,xi,. ..,x n ) | xi S S a x ny (expr u x,l3), ...,x n e S x ny (expr n , x, (3)} 

14. Context-related functions use the extension of variable bindings by pseudo- 
variables Size and Pos in rule ©: 

S a x y (position(), x, (3) = [3(Pos) and <S° nj/ (last(), x, (3) = /3(Size) . 

The following theorem states the equivalence of our semantics with that given in 
IjWadler 1999|l . which is in turn equivalent to the one defined by the W3C for XPath 
in jXQF<j 200lt . 

Theorem 1 ( Correctness of S and Q wrt. XPath) 

For XPath reference expressions without splitting NMTOKENS attributes, the seman- 
tics coincides with the one given in (Wadlcr 1999) (which already covers all core 
constructs of XPath as an addressing formalism): For every XPath expression expr, 

S x (expr) = S[[expr]](x) 

(for arbitrary x) where S [{expr\\ is as defined in iWadler 1999|) . 



A Logic- Programming Style XML Data Manipulation Language 



17 



Note that <S[[ea;pr]] defines only a result set that is implicitly ordered wrt. docu- 
ment order. Our semantics coincides with the document order as long as no deref- 
erencing is used. 

The proof uses the following Lemma which contains the structural induction (for 
proofs, see Appendix | Appendix A| ). 

Lemma 2 {Correctness of S and Q wrt. XPath: Structural Induction) 
XPath-Logic reference expressions correspond to XPath as follows: 

1. For absolute expressions (i.e., expr — /expr' , and no free variables): 

(expr, any, 0) = S \\expr\[ (x) for arbitrary x. 

2. For expressions, for all j3: S ( ^ y (expr,v, f3) — S[[expr]](v) . 

3. For step qualifiers, for all j3: Qx(stepQ ,v, /3p" g Size ) O Q[[stepQ]](v, k, n) . 

4. For arithmetic expressions and built-in functions, for all /3: 

S™ y (expr, v, /?p" s Size ) = £[[expr]](v, k, n) . 

where Q[[expr]], S [[expr]], and £ [[expr\\ are as defined in IjWadler 1999jl . bmce 
XPath expressions are variable-free, [3 is empty except handling the pseudo variables 
Size and Pos (which are often also empty). 

The above behavior deviates from XPath for special kinds of attributes: When 
navigating along reference attributes, the result is not in document order, but in 
the same order as the referencing elements were. Additionally, NMTOKENS that are 
considered as atomic in XPath, are split in XPath-Logic. 

2.3.2 Semantics of Formulas 

Definition 5 [Semantics of XPath-Logic Formulas) 

Formulas are interpreted according to the usual first-order semantics 

|= C (X-Structures x Var_Assignments x Formulas) 

15. Reference Expressions: The semantics of reference expressions corre- 
sponds to a predicate in first-order logic, defining a purely existential 
semantics: 

(X,/3) \= refExpr (S x {refExpr , /?)) ^ 

16. predicates and boolean connectives: same as in first-order logic. 

The above definitions associate a truth value semantics with XPath-Logic formulas. 
The (= relationcan be used for expressing integrity constraints on XML documents 
(see Example [3J and even sets of documents, and for reasoning on X-Structures. 
In contrast, when defining XPathLog as a data manipulation language in the next 
section, a completely different formalization of the semantics is given: there, as for 
Datalog queries, the answer substitutions for a formula containing free variables 
have to be computed. 
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2-4 Annotated Literals 

The XML data model distinguishes between elements and their text contents. Nev- 
ertheless, in several situations, elements containing text contents are expected to 
act as numbers or strings: 

Example 5 (Annotated Literals) 

Consider again the Mondial XML instance. The XPath queries 

//country[population > 5000000]/name/text() and 

//country[population/text() > 5000000]/name/text() 
are equivalent and return "Belgium" in their result set. In the first query, the 
element <population>10170241</population> is implicitly casted into its literal 
value. 

What happens here is not evident in XML/DTD environments. A corresponding 
XML Schema instance would show that the complexType population is derived from 
a simple type for integers. The idea here is that an element with a text contents 
adds structure to a simple type by allowing subelements and attributes. Thus, text 
elements with attributes behave as annotated literals: 

• comparisons, arithmetics, and (optionally) output use the literal value, 

• navigation expressions use the element node, and 

• in variable bindings, the variable is bound to the element, but it acts as 
described above when the variable is used e.g. in a comparison. 

3 XPathLog: The Horn Fragment of XPath-Logic 

Similar to the case of Datalog which is the function- free Horn fragment of first-order 
predicate logic, XPathLog is a logic programming language based on XPath-Logic. 

The evaluation of a query ?- I_i L n results in a set of variable bindings (of the 

free variables of the query) to elements of the universe. The semantics of XPathLog 
programs -i.e., the semantics of the evaluation of a set of XPathLog rules as a logic 
program- is then defined in Section 0] by combining the answer semantics with the 
model-theoretic semantics defined in the preceding section. 

Definition 6 (XPathLog) 

Atoms are the basic components of XPathLog rules: 

• an XPathLog atom is either an XPath-Logic reference expression which does 
not contain quantifiers or disjunction in step qualifiers, or a predicate expres- 
sion over such expressions. 

• an XPathLog atom is definite if it uses only the child, sibling, and attribute 
axes and the atom does not contain negation, disjunction, function appli- 
cations, and context functions. These atoms are allowed in rule heads (see 
Section PO)) . The excluded features would cause ambiguities what update is 
intended, e.g., "insert descendant" does not specify where the element 
should actually be inserted. 



A Logic- Programming Style XML Data Manipulation Language 



19 



Similar to Datalog, an XPathLog literal is an atom or a negated atom and an XPath- 

Log query is a list ?- I_i L„ of literals (in general, containing free variables). 

An XPathLog rule is a formula of the form A\, . . . , Ak <— L\, . . . , L n where Li are 
literals and Ai are definite atoms. L\, . . . , L n is the body of the rule, evaluated as 
a conjunction. Ai,..., Ak is the head of the rule, which may contain free variables 
that must also occur free in the body. In contrast to usual logic programming, we 
allow for lists of atoms in the rule head which are interpreted as conjunctions. 

3.1 Queries in XPathLog 

The semantics SB of XPathLog queries associates a result set and a set of answer 
substitutions with every XPathLog query by extending the above definition of S. 
The semantics provides the formal base for the implementation of an algebraic 
evaluation of XPathLog queries in LoPiX (cf. Sectional). 

3.1.1 Answers Data Model 

Whereas in Datalog, the answer to a query ?- Li, . . . , L n is a set of variable bindings, 
the semantics of XPath-Logic reference expressions is defined wrt. an X-Structure 
X as an annotated result list, i.e., the semantics of an expression is 

(i) a result list (corresponding to the result list of the underlying XPath expres- 
sion - i.e., without the additional variable bindings), and 

(ii) with every element of the result list, a list of variable bindings is associated. 

The result list (i) is the same as defined by S in Definition 01 equivalent to the 
one defined for XPath expressions in (Wadl er 1999) and by the W3C for XPath 
(XQFS 2001). Whereas from the XPath point of view for "addressing" nodes, only 
the result list is relevant, XPathLog queries are mapped to a set of variable bindings 
based on the associated bindings lists. 

Example 6 (Semantics) 

First, the semantics is illustrated by an example. Let X be the XML structure given 
in Example and 

expr :— //organization^O[ member/@country[@car_code^C and name/text()^N]] 

/abbrev/text()^A. 
The underlying XPath expression is 

//organization[member/@country[@car_code and name/text()]]/abbrev/text() . 
with the result list ( "UN" , "EU" ,...). With each of the results, a list of bindings for 
the variables 0, C, N, and A is associated, yielding the annotated result list 
SBx(expr) = list(( "UN" , { {O/un, A/ "UN", C/"AL", N/ "Albania"), 

[O/un, A/ "UN", C/"GR", N/ "Greece"), 

}). 

( "EU", { (O/eu, A/"EU", C/"D", N/ "Germany"), 
(O/eu, A/"EU", C/"F", N/ "France"), 

}). 
) 
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Definition 7 {Semantics) 

The domain of sets of variable bindings for Vi, . . . , V n (i.e., the domain of the second 
component of our semantics - i.e., the possible answer sets for a query whose free 
variables are V\, ■ ■ ■ , V n ) is 

Var_Bindings Vli ... >Vn := (2((Vu£uA0»)){v 1> ...,v B } _ 
Thus, in the general case for a general set Var of variables where n is unknown, 
Var_Bindings := |J (2((vu£uA0") )( Var") 

nGlN 

is the set of sets of variable assignments. For an empty set of variables, {true} is 
the only element in Var .Bindings; in contrast, means that there is no variable 
binding which satisfies a given requirement. We use (3 for denoting an individual 
variable binding, and £ € Var .Bindings for denoting a set of variable bindings. 

AnnotatedResults := ((VU£) x Var .Bindings) 

is the set of annotated results (i.e., an annotated result is a pair (i>,£) where v is a 
node or a literal and £ is a set of variable bindings (for the set of variables occurring 
free in a certain formula)). 

Definition 8 {Operators on Annotated Result Lists) 

From an annotated result list 9, the result list is obtained as Res(6>): 

Res : AnnotatedResults^ -> (V U C U Af)™ 

((Xi,£i), . . . , (x„,£ n )) l-> {xi, ...,X n ) 

For an annotated result list 9 and a given x 6 Res(0) contained in the result list, 
the set of variable bindings associated with x is obtained by Bdgs(6*,a;): 

Bdgs : AnnotatedResults 1 ^ x(VU£UA f )-» Var_Bindings 

(((H, ■ . . , (as, 0, • • • , (xn, e,,)), z) >-» e (let Bdgs(#, as) = if as £ Res(fl)) 

Note that the joins (xi) used in this section are always purely relational joins that 
are applied to the bindings component. 

Example 7 {Semantics (Cont'd)) 

Continuing Example Rcs(<S B x {expr)) = ("UN","EU", ...) is the result list of 
the underlying XPath expression, and 

Bdgs(SB x {expr), "EU") = { (O/eu, A/"EU", C/"D", N/ "Germany" ), 

(O/eu, A/"EU", C/"F", N/ "France"), 

: } 

yields the variable bindings that are associated with the result value "EU" . 



3.1.2 Safety 

The semantics definition evaluates formulas and expressions wrt. a given set of vari- 
able bindings which e.g., results from evaluating other subexpressions of the same 
query. This approach allows for a more efficient evaluation of joins {sideways infor- 
mation passing strategy) , and is especially needed for evaluating negated expressions 
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(by defining negation as a relational "minus" operator). Negated expressions which 
contain free variables are intended to exclude some bindings from a given set of 
potential results. Thus, for variables occurring in the scope of a negation, the input 
answer set to the negation must already provide potential bindings. This leads to 
a safety requirement similar to Datalog. 

Definition 9 (Safe Queries) 

First, safety of variables is decided for each individual ocurrence. A variable occur- 
rence V is safe wrt. the query if at least one of the following holds: 

• if the occurrence is in a literal L, and it is not inside the scope of a negation 
and not in a comparison predicate other than equality (e.g., X < 3 is unsafe). 

• if the occurrence is in a literal Li inside a step qualifier pattern[Li and ... and L n ] 
and V has a safe occurrence in pattern or in some Lj such that j < i. 

• if the occurrence is in a literal Li of the query ?-L% and ... and L n , and V 
has a safe occurrence in some Lj such that j < i. 

A query ?- I_i, . . . , L n is safe if all variable occurrences in the query are safe. 

3.1.3 Semantics of Expressions 

In the following, the semantics of safe queries is defined. The basic (non-annotated) 
result lists are again provided by Ax (axis, v) for every node v of X. 

Definition 10 (Answer Semantics of XPath- Logic Expressions) 

The semantics is defined by operators SB and QB derived from S and Q as defined 

in Definitional the B stands for the extension with variable bindings: 

• SBx '■ (Reference-Expressions) — > AnnotatedResults^ , and 

(Axes x V x Reference_Expressions x Var .Bindings) — > AnnotatedResults 14 

evaluates reference expressions wrt. an axis, an (optional) context node and 
a given set of variable bindings and returns an annotated result list. 

• QBx '■ (Predicate_Expressions x V x VarJ3indmgs) — » Var_Bindings 

evaluates step qualifiers wrt. a context node to sets of variable bindings. 
Expressions are evaluated by SB: 

1. If no input bindings are given, SBx(refExpr) = SB% 1 1 (refExpr , any, 0) 

2. Reference expressions are translated into path expressions wrt. a start 
node: 

• entry points: rooted path: SB x ny (/p, any, Balgs) = SB x ny (p, root, Bdgs) 
where root is the current root as in Definition!!© for the same case. 

• entry points: constants c € Sc: SB a ^ v (c/p, any, Bdgs) — SB a ^ v (p, c, Bdgs) 

• rooted paths in other documents: 

l S^" y (document( ' http://. . . ')/p, any, Bdgs) = S x ny (p,root, [3) 
where root is the root node of the document stored at http:// 
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• entry points: variables V £ Var: 

SB a x y (V/p, any, Bdgs) = concat xeVactive (SB™ y (p, x, Bdgs x {V/x})) 

where V ac ti V o is the set of element nodes in the current database. 
Remark: Here, the input bindings are used for optimization: if every 
(i £ Bdgs provides already bindings for the variable V, the sideways 
information passing strategy directly effects the join {V/x} IX Bdgs, 
restricting the possible values for V which in fact results in 

SB a x y (V/p, any, Bdgs) = concat peBdg ^ x=p{v) (SB a x y (p, x, Bdgs x {V/x})) 

Thus, the propagation of bindings is not only necessary for han- 
dling negation but also provides a relevant optimization for positive 
literals. 

Note that in the recursive call SB a x y (p, x, Bdgs cxi {V/x}), the prop- 
agated bindings are already augmented with the bindings for V. 

3. Axis step: SB a x v {axis :: pattern, x, Bdgs) — SB a x ls '{pattern, x, Bdgs) 
where pattern is of the form nodetest remainder where remainder is a 
sequence of step qualifiers and variable bindings. These are evaluated 
from left to right, always applying the rightmost "operation" (qualifier 
or variable) to the result of the left part: 

4. Node test: 

SB a x (name,x,Bdgs) = \\st (v ^ n)eAx{a ^ h n=name (v, {true} txi Bdgs) 
SB a x (node(),x,Bdgs) = \\st (v ^ n)eAx{a ^ h veV (v, {true} t>3 Bdgs) 
SB x (text(), x, Bdgs) = list(„ i „) e ^( aiX ) i veC (v, {true} x Bdgs) 
SB a x {N,x,Bdgs) = Wst {Vtn)eAx{atX) (v, {N/v} ix Bdgs) 

5. Step with variable binding: 

SB a x {pattern V,x,Bdgs) = \\st (y ^ )eSB a x{pattern x Bdgs) (y,^t><i {V/y}) 

6. Step qualifiers: 

SB x (pattern[stepQ],x, Bdgs) = 

Kst(y,i)£SB a x (pattern,x,Bdgs), QB x (stepQ, V ,t')jt®(y, QB X (stepQ , y, £') \ {Pos, Size}) 

If the step qualifier does not contain context functions, then £' := £, 
otherwise let L := SB x (pattern, x, Bdgs), and then for every (y, £) in 
L, £' is obtained as follows, extending £ with bindings of the pseudo 
variables Size and Pos: 

• start with £' = 0, 

• for every (3 £ £, the list L' = \\st^ y _^ eL s t . ^^(y) contains all nodes 
which are selected for the variable assignment 0. 

• let size := size(L'), and for every y, let j the index of x\ in L' , 
pos := j if a is a forward axis, and pos := size+l-j if a is a backward 
axis. 

. add fj::;Z s to 

7. Path: SB a x (p 1 /p 2 ,x, Bdgs) = concat iyiS)eSB ^y {puXiBdgs) SB x (p 2 , y, 
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Step Qualifiers are evaluated by QB: 

8. Reference expressions (existential semantics) in step qualifiers: 

QBx(refExpr,x,Bdgs) = (J f 

(y,£)eSB°f v (ref Expr, x, Bdgs) 

9. The built-in equality predicate "=" is not only a comparison if both 
sides are bound, but also serves as an assignment if the left-hand side 
is a variable V <E Var which is not bound in Bdgs: 

QB X (V = expr, x, Bdgs) = (J £ x {V/y} 

(y ,£)£SB°^ ly (expr, x, Bdgs) 

All other built-in comparisons require all variables to be bound: 
QBx{expr 1 op expr 2 , x, Bdgs) = £i x £ 2 • 

(xj,£j)6(S£;t(expri,x,Bd(7s) , x\ op 2:2 

10. Predicates except built-in comparisons: 

QB x (pred(expn,...,expr n ),x,Bdgs) = (J £1 m . . . x £„ 

(expr i,x, Bdgs) 
(xi,...,x n )£T(pred) 



QB x (not A, a;, Brfgs) 



11. Negated expressions which do not contain any free variable: 

Bdgs if QBx(A,x,Q) = 0, 
otherwise, i.e., if QB X (A, x, 0) = {true}. 

12. For negated expressions which contain free variables, negation is in- 
terpreted as the "minus" operator (as known e.g., from the relational 
algebra) wrt. the given input bindings. Thus, all variables which occur 
free in A must be safe, i.e., every input variable binding has to provide 
a value for them. 

For two variable bindings [3\ , p\ , we write j3\ < p\ if all variable bindings 
in Pi occur also in /3 2 - Intuitively, in this case, if (3\ is "abandoned", f3 2 
should also be abandoned. 

QB x {not expr, x, Bdgs) = 

Bdgs - {/3 G Bdgs | there is a /?' £ QB x (expr, x, Bdgs) s.t. (3 < /?'} 

13. Conjunction: 

QB x (expri and expr 2 , x, Bdgs) = 

QB x {expr \,x, Bdgs) x QBx(expr 2 , x, QBx(expr\, x, Bdgs)) 

Here, in case of negated conjuncts in the step qualifier, the safety of 
variables has to be considered. The above definition assumes that by a 
left-to-right evaluation of conjuncts, the evaluation is safe. 

Evaluation of Terms 

14. Constants: for literals, SB°^ V (lit, x, Bdgs) = (lit, Bdgs). For constants 
c e Sc, SB "^ (c, x, Bdgs) = (lc(c), Bdgs). 
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15. Variables: the variable occurrence must be safe, then: SB x ny (var, x, Bdgs) = 
\istp£Bdgs(P(var), f3). 

16. Function terms and arithmetics: 

<SB a ^ v (f(arg 1 , . . . ,arg n )),x, Bdgs) = 

y,St (x t ^)eSt3™ v (arg u x,Bdgs),...if( x l, ■ ■ ■ ,%n),£l X • . . CX £„) 

where f(x\, . . . , x n ) results from the built-in evaluation of /. 

17. Context-related functions use the extension of variable bindings by pseudo- 
variables Size and Pos in rule (JHJ: 



SB a x V (pos\t\on{), x, Bdgs) = \\st 0eBdgs (f3(Pos), {/?' G Bdgs \ (3 (Pos) = P'(Pos)}) 
SB™ v (\astQ,x,Bdgs) = I ist£ eB d gs (/3(Pos) , {/?' e Bdgs \ P(Size) = (3' '(Size)}) 



The above semantics is an algebraic characterization of the logical semantics of 
XPath-Logic expressions which has been defined in Section 

Theorem 3 (Correctness of SB and QB) 

For every (in general, containing free variables) XPathLog expression expr, 



/ 3e(Vu£iW) fre =("p''> 
More detailed, for all x e V U C U Af, 

(x E Rcs(SBx(expr)) and /? € Bdgs(S Bx(expr), x)) <^ x S Sx(expr, (5) . 
Again, the theorem uses a lemma which encapsulates the structural induction. 
Lemma 4 (Correctness of SB and QB: Structural Induction) 

The correctness of the answers semantics of XPathLog expressions mirrors the gen- 
eration of answer sets by the evaluation: The input set Bdgs may contain bindings 
for the free variables of an expression. If for some variable var, no binding is given, 
the result extends Bdgs with bindings of var. If bindings are given for var, this 
specifies a constraint on the answers to be returned (expressed by joins). 

• For every absolute expression expr, (i.e., expr — fexpr') and every set Bdgs 
of variable bindings, 

(x G Res(SBx (expr, Bdgs)) and /3 <E Bdgs(S B x (expr, Bdgs), x)) <^> 
(x 6 Sx(expr, (3) and (3 completes some f3' € Bdgs with free(expr)) . 

• For every expression expr, node v, and every set Bdgs of variable bindings, 

(x G Rcs(S Bx (expr, v, Bdgs)) and G T$dgs(S Bx (expr, v, Bdgs), x)) <^=> 
(x G S x (expr , v , (3) and /? completes some /3' G Bdgs with free(expr)) . 

• for every step qualifier stepQ, node v, and every set Bdgs of variable bindings, 



(3 G QB x (stepQ,v,Bdgs) <=> 

Qx(stepQ,v,f3) and (3 completes some f3' G Bdgs with free(stepQ)) . 






The proof can be found in Appendix | Appendix 
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3.1.4 Semantics of Queries 

According to Definitional XPathLog queries are conjunctions of XPathLog literals. 
In the following, the evaluation of safe queries is defined. The definition of safety 
guarantees that a left-to-right evaluation of the body is well-defined (i.e., all variable 
evaluations in Definition 1 1 Qlj 1 5(1 are safe). Definition HOpiHjl already applied left-to- 
right propagation when evaluating step qualifiers. 

Definition 11 

The evaluation QB is extended to atoms by 

QBx ■ (Atoms x Var .Bindings) — > Var .Bindings 
(A, Bdgs) 1 — ► QB x {A,root,Bdgs) 

For safe queries ?- atom consisting of only one atom, QB yields the answer bindings: 
(X,0) \= atom (3 G QB x {atom,$)) . 

Definition 12 {Evaluation of Negated Literals) 

The evaluation of negated literals L is defined wrt. a set of input bindings which 
must cover the free variables in L, similar to negation in step qualifiers in Dcf. ll()l(12(l : 

QB X (not A, Bdgs) := 

Bdgs - {/? G Bdgs | there is a /?' G QB X (A, Bdgs) s.t. [3 < (3'} . 

Definition 13 {Evaluation of Queries) 

The evaluation of a safe query ?- L\, . . . , L n is defined similar to the evaluation 
of conjunctive step qualifiers in Definition 1 1 0I| 1 3fl : 

QBx ■ Conj .Literals — + Var .Bindings 

QBx{L\ A ... A Li) := QB X {L X A ... A L^) >i 

QBx {Li, {QB X {L X A ... A ii- 1 ))| free (L i )) 

Given an X-Structure X, the answer to a query ?- I_i L„ is the set 

answers;t(Li, . . . , L„) := QBx{\-iA... A L„) of variable bindings. 

Theorem 5 {Correctness: Evaluation of Queries) 

For all safe XPathLog queries Q, [3 G QB X {Q) & (X,[3) \= Q . 

Note that the semantics of formulas is not based on a Herbrand structure consisting 
of ground atoms (as "usual" Herbrand semantics are), but directly on the interpre- 
tations Ax of the axes in the X-Structure, and on an interpretation of predicate 
symbols that can be represented as a finite set of tuples over V U £ U Af. 

4 XPathLog Programs 

In logic programming, rules are used for a declarative specification: if the body of a 
clause evaluates to true for some assignment of its variables, the truth of the head 
atom for the same variable assignment can be inferred. Depending on the intention, 
this semantics can be used for (top-down) checking if something is derivable from 
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a given set of facts, or (bottom-up) extending a given set of facts by additional, 
derived knowledge. In this work, we mainly investigate the bottom-up strategy, 
regarding XPathLog as an update language for XML databases: the evaluation of 
the body wrt. a given structure yields variable bindings which are propagated to 
the rule head where facts are added to the model. 

Positive XPathLog programs (i.e., the rules contain only positive literals; also 
step qualifiers may only contain positive expressions) are evaluated bottom-up by a 
Tp-like operator over the X-Structurc, providing a minimal model semantics. The 
formal definition of a Tp operator will be given in Definition ^j] for XPathLog 
programs after explaining the semantics of insertions and updates. 

4-1 Atomization 

In this section, an alternative semantics of conjunctions of definite XPathLog atoms 
is defined which provides the base for the constructive semantics of reference ex- 
pressions in rule heads. The semantics is defined by resolving reference expressions 
syntactically into their constituting atomic steps in the same way as in F-Logic (cf. 
l|Frohn et al. 19 94)). A similar strategy for resolving expressions into atomic steps 
is followed by several approaches which store XML data in relational databases 
UDeutsch et al. 2000l|Shanmugasundaram et al. |IFlorescu and Kossmann 1999(1 . by 
flattening the XML instance to one or more universal relations. 

Definition H {Atomization of Formulas) 

The function atomize : XPathLogAtoms -> 2 XPathLo s Atoms resolves a definite 
XPathLog atom into atoms of the form node[axis::nodetest—>result] and predicates 
over variables and constants. It will be used in Definition 1161 for specifying the 
semantics of rule heads, atomize is defined by structural induction corresponding 
to the induction steps when defining Sx- In the following, path stands for a path 
expression (or a variable) , and name for a name (or a variable) . 

• the entry case: atomize(/ 'remainder) := atomize(root/ 'remainder) 

• Paths are resolved into steps and step qualifiers are isolated (since context 
functions are not allowed in definite atoms, it can be assumed that there is 
at most one step qualifier, optionally preceded by a variable assignment): 

atom\ze{path/ axis :: nodetest — > var [step Qualifier] /remainder) := 
atom\ze(path[axis :: nodetest — > var]) U 
atomize(war [step Qualifier]) U stom\ze(var /remainder) , 

atom\ze{path/ axis :: nodetest[step Qualifier] /remainder) := 
atom\ze(path[axis :: nodetest — > -X]) U 
atomize(_A [stepQualifier]) U atom\ze(_X /remainder) 
where -X is a new don't care variable. 

• Conjunctions in step qualifiers are separated: 

atomize(war[prediand . . . and pred n ]) := 

atom\ze(v ar[predi]) U . . . U atom\ze(var[pred n ]) 
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• Predicates in step qualifiers: 

atom\ze(var[pred(expri , . . . , expr n )]) := atom\ze(equality(var, expri, _X"i)) U . . . 

atom\ze(equality(var, expr n , -X n )) U 
{pred(JC u ---,-Xn)} 

where equality(var, expr, X) is denned as follows (if expri is a constant, it is 
not replaced by a variable) : 

* equality (var, expr, X) — u expr — > A" if expr is of the form //remainder, 

* equality (var, expr, X) = ll var/expr — » X" if expr is of the form axis :: 
nodetest remainder. 

• Predicate atoms are handled in the same way. Note that here all arguments 
are absolute expressions (rooted, or starting at a constant, or at a variable). 

Example 8 (Atomization) 

?- //organizational!^ name/text()^ON and 

@headq = members/@country[name/textQ^CN]/@capital]. 

is atomized into 

?- root[descendant::organization— >0], 0[name->_ON], _ON[text()— >C)N], 
0[@headq->_S], 0[members-+_M], _M [©country— >_C], _C[@country->_Cap], 
_S = _Cap, _C[child::name->_CN], _CN[text()->CN]. 

Theorem 6 (Correctness of atomize) 

The above semantics is equivalent to the one presented in Definition ^| for all 
definite XPathLog atoms A and every X-Structure X, i.e. 

answers,^ (A) = answers,^ (atomize(A)) 

Again, the theorem uses a lemma which encapsulates the structural induction, using 
the logical semantics for showing the correctness of atomize. 

Lemma 7 (Correctness of atomize: Structural Induction) 

For every X-Structure X and every definite XPath-Logic atom A, 

• for every variable assignment (3 of free(A) such that (X, (3) \= A, there ex- 
ists a variable assignment j3' D j3 of free(atomize(A)) such that (X,f3 r ) \= 
atomize(A), and 

• for every variable assignment f3' of free(atomize(j4)) such that (X,(3') (= 
atomize(A), (X , [3' \ free{A) ) \= A. 

The proof can be found in Appendix | Appendix A| 

4.2 Left Hand Side 

Using logical expressions for specifying an update is perhaps the most important 
difference to approaches like XSLT, XML-QL, or XQuery where the structure to be 
generated is always specified by XML patterns, or to the update proposal for XML 
described in <|Tatarinov et al. 2001J1 . In contrast, in XPathLog, existing nodes are 
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communicated via variables to the head, where they are modified when appearing 
at host position of atoms. The semantics of the left hand side of XPathLog rules 
- which is a list of definite XPathLog atoms - is now investigated based on the 
atomization of expressions. When used in the head, the "/" operator and the "[. . . ]" 
construct specify which properties should be added (thus, "[. . . ]" docs not act as 
a step qualifier, but as a constructor). When using the child or attribute axis for 
updates, the host of the expression gives the element to be updated or extended; 
when a sibling axis is used, effectively the parent of the host is extended with a 
new sub element. 

Note that the (pure) XPathLog language does not allow to delete or replace 
existing elements or attributes 3 - modifications are always monotonic in the sense 
that existing "things" remain. 

Generation or Extension of Attributes. A ground-instantiated atom of the form 
n[@a^v] specifies that the attribute @a of the node n should be set or extended 
with v. If v is not a literal value but a node, a reference to v is stored. 

Example 9 (Adding Attributes) 

We add the data code to Switzerland, and make it a member of the European 
Union: 

C[@datacode^ "ch"], C[@memberships^O] :- 

/ /country^C[@car_code= "CH"], //organ ization^O[abbrev/text()^ "EU"]. 

results in 



country datacode= "ch" car_code= "CH" industry^ "machinery chemicals watches" 



memberships= "org-efta org-un org-eu ..."> ... </country 



Creation of Elements. Elements can be created as free elements by atoms of the 
form / name[...] (meaning "some element of type name" - this is interpreted to create 
an element which is not a subelement of any other element), or as subelements. 

Example 10 (Creating Elements) 

We create a new (free) country element with some properties (cf. Figures [5] and |3J) : 
/country[@car_code^ "BAV" and Ocapital^X and city^X and city— >Y] :- 

//city->X[name/text()= "Munich"], //city— >Y[name/text()= "Nurnberg"]. 

The two city elements are linked as subelements. This operation has no equivalent 
in the "classical" XML model: these elements are now children of two country 
elements. Thus, changing the elements effects both trees. Linking is a crucial feature 
for efficient restructuring and integration of data (cf. (M ay and Behrends 2001) )) . 



Insertion of Subelements. Existing elements can be assigned as subelements to other 
elements: A ground instantiated atom n[child :: s — > m] makes m a subelement of 
type s of n. In this case, m is linked as n/s at the end of n's children list. 

3 suitable extensions, e.g., of the form delete(e/em,prap, val) can be defined. Such extensions which 
would turn XPathLog into a rule-based imperative language are not investigated in this work. 
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Example 11 {Inserting Subelements) 

The following two rules are equivalent to the above ones: 
/country [@car_code^ "BAV"]. 

C[@capital^X and city^X and city^Y] :- //country^C[@car_code= "BAV"], 
//city^X[name/text()^ "Munich"], //city^Y[name/text()^ "Nurnberg"]. 

Here, the first rule creates a free element, whereas the second rule uses the variable 
binding of C to this element for inserting subelements and attributes. 

In the above case, the position of the new subelement is not specified. If the atom 
is of the form h[ch\\d(i)::s^v] or /i[following/preceding-sibling(j)::s^t>], this means 
that the new element to be inserted should be made the zth subelement of h or jth 
following/preceding sibling of h, respectively. 



30 



W. May 



Generation of Elements by Path Expressions. Additionally, subelements can be 
created by reference expressions in the rule head which create nested elements that 
satisfy the given reference expression. The atomization introduces local variables 
that occur only in the head of the rule. Here, we follow the semantics of PathLog 
HFrohn et al. 1994)) which is implemented in Florid (Tud ascher et al. 1998|) for 
object creation. After the atomization, the resulting atoms are processed in an order 
such that the local variables are bound to the nodes/objects which are generated. 

Example 12 {Inserting Text Children) 
Bavaria gets a text subelement name: 

C/name[text()^ "Bavaria"] :- //country^C[@car_code="BAV"]. 
Here, the atomized version of the rule is 

C[name->_N], _N[text()-> "Bavaria"] :- 

root[descendant::country^C], C[@car_code= "BAV"]. 
The body produces the variable binding C/bavaria. When the head is evaluated, 
first, the fact faa\/ana[child::name^a;i] is inserted, adding an (empty) name subele- 
ment x\ to bavaria and binding the local variable JV to x\. Then, the second atom 
is evaluated, generating the text contents to x\. 

Once-for- each- Binding. In contrast to classical logic programming where it does 
not matter if a fact is "inserted" into the database several times (e.g., once in every 
Tp round), here subelements must be created exactly once for each instantiation of 
a rule. We define a revised Tp-operator in Definition 1161 

Using Navigation Variables for Restructuring. For data restructuring and integra- 
tion, the intuitiveness and declarativeness of a language gains much from variables 
ranging not only over data, but also over schema concepts (as, e.g., in SchemaSQL 
l|Lakshmanan et al. 1996J) ). Such features have already been used for HTML-based 
Web data integration with F-Logic l|Ludascher et al. 1 998). 

Extending the XPath wildcard concept, XPathLog also allows to have variables 
at name position. Thus, it allows for schema querying, and also for generating new 
structures dependent on the data contents of the original one. 

Example 13 [Restructuring, Name Variables) 

Consider a data source which provides data about waters according to the DTD 
<! ELEMENT terra (water+, ...)> 

<!ELEMENT water (...)> <!ATTLIST water name CDATA #REQUIRED 

which contains, e.g., the following elements: 

<water type= "river" name= "Mississippi" > ... </water> 

<water type= "sea" name="North Sea" > ... </water> . 
This tree should be converted into the target DTD 

<!ELEMENT geo ((river|lake|sea)*)> 

<!ELEMENT river (...)> <!ATTLIST river name CDATA #REQUIRED 
(analogously for lakes and seas) 
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The first rule, result/T[@name^N] :- //water[@type^T and Oname^N] 
creates <river name= "Mississippi" /> and <sea name= "North Sea"/> . 
Attributes and contents are then transformed by separate rules which copy prop- 
erties by using variables at element name and attribute name position: 

X[@A^V] :- //water[@type^T and Oname^N and @A->V], //T->X[@name->N]. 

X[S^V] :- //water[@type^T and Oname^N and S->V], //T->X[@name->N]. 

4-3 Global Semantics of Positive XPathLog Programs 

An XPathLog program is a declarative specification how to manipulate an XML 
database, starting with one or more input documents. The semantics of XPathLog 
programs is defined by bottom-up evaluation based on a Tp operator. Thus, the 
semantics coincides with the usual understanding of a stepwise process. 

For implementing the once-for-each-binding approach, the Tp operator has to 
be extended with bookkeeping about the instances of inserted rule heads. Addi- 
tionally, the insertion of subelements adds some nonmonotonicity: adding an atom 
n[child(i)::e — ^v] adds a new subelement at the ith position, making the original ith 
child/sibling the i+lst etc. In case of multiple extensions to the same element, the 
positions are determined wrt. the original structure. 

Definition 15 {Extension of X- structures) 

Given an X-Structure X and a set X of ground-instantiated atoms as obtained from 
atomize to be inserted, the new X-Structure X' = X -< X is obtained as follows: 

• initialize ^4^-/ (child, x) := ^4.^ (child, x), Ax 1 (attribute, x) := Ax (attribute, x) , 

preds(A") := preds(A') U {p | p G X is a predicate atom} 
for all node identifiers x. 

• for all elements of Ax (child, x), let a(Ax (child, x)[i]) := Ax 1 (child, x)[i] 
(a maps the indexing from the old list to the new one). 

• for all atoms x[child(z) ::e^i/]el, insert (y, e) into ^4^' (ch ild, x) immedi- 
ately after a^A* (child, 

• for all atoms x[child :: e — > y] G X, append (y, e) at the end of Ax'{ck\\d,x). 

• analogously for sibling axes. 

• for all atoms x[@a — ► y] G X, append (y, a) to Ax' (attribute, x). 

Proposition 8 {Extension of X- Structures) 

The extension operation is correct: X ~< X \= X, i.e., when querying the inserted 
atoms, the query evaluates to true. 

With the correctness of atomize, the insertion of rule heads performs correctly: 
Corollary 9 {Correctness of Insertions) 

For inserting the ground-instantiated head of a rule, it is correct to insert the 
atomized head: For all ground XPathLog atoms A, X -< atomize(^4) |= A . 
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Definition 16 (TXp -Operator for XPath-Logic Programs) 

The 2X-operator works on pairs (X, Die) where X is an X-Structure, and Die is 
a dictionary which associates to every rule a set £ of bindings which have been 
instantiated in the current iteration: 

(X, Dtc)+({(n, j9i), . . . , (r„, /?„)}) := (A? -< {f3i(atomize(head(n))) \ l<i< n}, 

Dic.insert{{(rx, 0x), . . . , (r„, /?„)})) , 

ji := X . 

For an XPathLog program P and an X-Structure X, 

TX P {X, B) := (X, B)+{(r, [3) \ r = (h <- 6) £ P and X |= /3(6), and (r, /3) £ 5} , 
2X°(*) :=(*,0)', 
2X)+ 1 (A') := TX P (TX p (X)) , 



p 

(lim i _ 00 2Xj,(#)) |x if 2X$(#),2X]i (#),... converges, 
_L otherwise. 



2X£(,Y) 



Remark 3 

Note that for pure Datalog programs P (i.e., only predicates over first-order terms), 
the evaluation wrt. TXp does not change the semantics, i.e., TXp(X) = Tp(X) . 

Proposition 10 (Properties of the TXp operator) 

The TXp operator extends the well-known Tp operator. For all positive XPathLog 
programs P, the following holds: 

• without considering context functions, the TXp operator is monotonous (which 
guarantees that a minimal fixpoint TXp(X) exists), 

• TXp is order-preserving: for all XPathLog reference expressions expr which do 
not use negation or context functions, Sx(expr) is a sublist of Sjx P (x){ ex P r )i 

• for all atoms A that do not contain aggregations or function applications, if A 
holds in X, then it also holds after application of TXp: X \= A => TXp(X) \= A 

Proof Both properties follow immediately from the definition. The child and at- 
tribute axes are extended solely by appending and inserting new "facts" . 



4-4 Semantics of General XPathLog Programs 

For logic programs which use negation (or similar nonmonotonic features, such 
as aggregation), there is no minimal model semantics. Instead, their semantics is 
defined wrt. perfect models, well-founded models, or stable models. For practical use 
- especially when considering bottom-up evaluation - the notion of perfect models 
and stratification ( |Przymusinski 1988| ) provides a solution to the problems raised 
by negation and other nonmonotonic features. Stratification expresses the intuitive 
notion of process which executes as a sequence of steps. 

Note that already not all Datalog programs are stratifiable. For logics over com- 
plex structures such as e.g., F-Logic, a reasonable notion of stratification can be 
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defined based on the names occurring at property position - as long as variables are 
not allowed at the property position. With variables allowed at property position, 
it has been showed for F-Logic in l|Frohn 1998(1 that programs are in general not 
stratifiable. Since (i) even without variables at property position, there are many 
programs which are not syntactically stratifiable, and (ii) variables at the property 
position prove to be very useful for data integration (cf. Example ILlj) . syntax-based 
stratification is not suitable for our approach. Since the intention of XPathLog pro- 
grams is in general to implement a stepwise process by bottom-up evaluation, often 
there is a natural, user-defined stratification. User defined stratification is supported 
in the LoPiX system ( |May 2001d| l (cf . Section^) ■ The semantics is computed in the 
same way as for positive programs by iterating the TXp operator for each stratum. 

Language Extensions. In addition to the pure language as described above, XPath- 
Log supports several extensions. A detailed description of, e.g., aggregation (as 
known, e.g., from SQL) and a class hierarchy and signatures (taken from F-Logic), 
and data-driven Web access can be found in ( |May 2001a| ). 

5 Implementation and Application 

5.1 Implementation: The LoPiX System 

XPathLog has been implemented in the LoPiX system {May 2001d| l which ex- 
tends the pure XPathLog language with a Web-aware environment and additional 
functionality for data integration. LoPiX has been developed using major compo- 
nents from the Florid system IIFLORID 19981 ILudascher et al. 1998)l , an imple- 
mentation (in C++) of F-Logic. Due to the similarities between the F-Logic data 
model and the XML data model in general, and XPathLogic's multi-overlapping- 
tree model in particular, the Florid modules provided a solid base for an XPathLog 
implementation. Especially the functionality of the complete module for the eval- 
uation of a deductive language over a data model with complex objects could be 
reused. The system architecture of LoPiX is depicted in Figure 0] 

Storage. The (extensional) database, is stored in the ObjectManager. Here, two 
variants have been developed: the first one uses a proprietary integrated, frame- 
based model (from Florid) that is equipped with indexes for optimized access, 
whereas the second one is based on a standard DOM implementation. 

The ObjectManager Access encapsulates the storage by implementing the abstract 
XTreeGraph data model based on the contents of the ObjectManager. This abstrac- 
tion level also adds intcnsional properties including derived axes, transitivity of class 
hierarchy, downwards closure of signatures, inheritance, object fusion, synonyms, 
built-in functionality for data conversion, string handling including matching regu- 
lar expressions, arithmetics, and aggregation operators. 

The WebAccess functionality is closely intertwined with the OMAccess module: 
XML sources are mapped to trees in the internal database. Additionally, a method 
for mapping a DTD to XPathLog signature atoms is provided. 



34 



W. May 



o 



User Interface 


Pretty Printer 
Bindings/XML - 




System 
Commands 


XPathLog Parser 

(Programs and Queries) 





* ^ 

Logic Evaluation 

(Bottom-up) TXp 


A 
Eva] 


»' 

lgebraic 

uation (<S) 


s 1 

Algebraic 

Insertion 




WebAccess 



DTD 


XML 


Parser 


Parser 



Object 
Manager 



interactive 
Output 



XML 
output 



XML 
urli 



DTD 

urb 



= = Internet in- and output 

]> - inserts to internal storage 

*• internal information flow 

► querying internal storage 



Fig. 4. Architecture of the LoPiX System 



Evaluation. The central Evaluation module (LogicEvaluation, AlgebraicEvaluation, 
and Algebraiclnsert) is taken nearly unchanged from Florid and provides in fact 
a generic implementation of a deductive language over a data model with complex 
objects. LogicEvaluation implements a seminaive bottom-up evaluation of rules. 
AlgebraicEvaluation translates rule bodies and heads into the underlying object al- 
gebra and evaluates the generated algebraic expressions using the query interface of 
OMAccess. The object algebra implements the semantics of XPathLog queries de- 
scribed in Section f3.ll generating sets of tuples of variable bindings. Algebraiclnsert 
instantiates the rule heads with the generated variable bindings and adds the corre- 
sponding facts into the database using again the OMAccess interface, implementing 
the 2Xp-semantics defined in Section^] The evaluation of algebraic expressions does 
not materialize any intermediate result, but is purely based on nested iterators. 

Execution and Userlnterface. The execution module provides the infrastructure for 
the system, consisting of a Parser (lex/yacc-based) and a SystemCommands module 
that implements (partially) non-logical commands for controlling the evaluation 
process. The Userlnterface module allows to use LoPiX from the command shell 
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by invoking system commands and stating interactive queries. The PrettyPrinter 
outputs answers in the variable bindings format known from Datalog; additionally, 
the result of queries that bind only a single variable can be output as a result set in 
XML ASCII representation. Additionally, result views, i.e., the projections of trees 
rooted in a given node to a given signature can be exported. 

5.2 Case-Study: Mondial 

XPathLog/LoPiX has successfully been applied in the Mondial case study | |May 2001e| 
|May 200 lb\ . There, the practicability of the approach for data integration is illus- 
trated by integrating a geographical database from the XML representations of its 
sources (which have been created by Florid wrappers in ( |May 19 99)). 

The CIA World Factbook: The CIA World Factbook Country Listing (cia:, 
http://www.odci.gov/cia/publications/pubs.html) provides political, eco- 
nomic, and social and some geographical information about the countries. A 
separate part of the CIA World Factbook provides information about political 
and economical organizations (orgs:). Here, the data sources overlap by the mem- 
bership relation: with every organization, the member countries are stored in orgs 
by name (using the same names as in the cia part). 

Global Statistics: Cities and Provinces: The Global Statistics data (gs:, http://www .stats.d emon.nl) 
provides information (grouped by countries) about administrative divisions (area 
and population, sometimes capital) and main cities (population with year, and 
province). Whereas the country names are the same as in CIA, the names of 
cities, that are e.g. capitals of countries or where the headquarter of a political 
organization is located, may differ. 

The case-study showed that XPathLog allows for an effective, and elegant program- 
ming of the integration process. The nature of an XPathLog program as a list of 
rules allows for grouping rules which together handle a certain task. The programs 
are modular which also allows for adapting them to potential changes in the source 
structure and ontology. 

For data integration in general, not only "simple" updates are desired, but also 
specialized operations on tree fragments. The result is constructed using subtrees, 
elements, and literals of the input sources by the integration operations that extend 
the basic XPathLog in LoPiX. These operations heavily depend on the use of the 
XTreeGraph data model ( |May and Behrend~ 2001 ). 

Fusing Elements and Subtrees. Fusing elements that represent the same real-world 
entity from different data sources into a unified element is an important task in 
information integration. The result is still an element of both source trees, and 
collects the attributes and subelements of both original elements. 

Example 14 (Object Fusion) 

Consider two data sources as shown below and in Figure [S^a). Both describe coun- 
tries, where cia contains information about name, area, population, and capital, 
and gs contains information about cities. 
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<!ELEMENT cia (country+)> 
<!ELEMENT country (border*)> 

<!ATTLIST country name CD ATA ^REQUIRED car.code ID #REQUIRED 

area CDATA #IMPLIED population CDATA #IMPLIED 

capital CDATA #REQUIRED> 
<!ELEMENT border (#PCDATA)> <!ATTLIST border country IDREF #REQUIRED> 
<!ELEMENT gs (country+)> 

<!ELEMENT country (city+)> <!ATTLIST country name CDATA #REQUIRED> 
<!ELEMENT city EMPTY> 

<!ATTLIST city name CDATA ^REQUIRED pop CDATA #REQUIRED> 

Excerpts of the instances: 

<cia> <gs> 

<country car_code='D' capital='Berlin' <country name='Germany'> 

name='Germany' area='356910' <city name= "Berlin" pop= "3472009" /> 

population='83536115'> <city name= "Hamburg" pop= "1705872" /> 

<border country='F'>451</border> : 

<border country='A'>784</border> </country> 

</country> </g s> 

</cia> 

An obvious and typical integration step is to unify the countries in the cia tree with 
the countries in the gs tree. In XPathLog, this is done by the rule 

CI = C2 :- cia/cia:country^Cl[@cia:name^N], gs/gs:country^C2[@gs:name^N]. 
The example is continued below, Figure OJb) depicts the final result. 

Synonyms. Names are also subject of operations, e.g., the integrated database uses 
a unified terminology that differs from the source terminologies. Instead of generat- 
ing new relationships between nodes, target terminology is introduced by synonyms 
for already existing relationships. 

Example 15 (Integration: Synonyms) 

Especially, synonyms are an efficient means for taking a whole property from a 
source tree (and namespace) to the result tree: Consider the situation obtained in 
Example 1 1 41 where the following synonyms are defined: 

cia:name = name. gsxity = city. gs:text() = text(). 

cia:area = area. gs:name = name. 

cia:population = population. gs:pop = population. 

Adding Links. The integrated database often contains additional links (by subele- 
ment or reference attribute relationships) between elements that originally belong 
to different sources. 

Example 16 (Integration: Additional Links) 

The integration is completed by linking the country subtrees to a result tree and 
adding the capital reference attributes, here, using germany[@cia:capital= "Berlin"] 
and berlin[name— "Berlin"]. The resulting tree fragment is given in Figure[SJb). 
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In XPathLog, this is done by the rules 
result[country^C] :- cia[cia:country^C]. 
C[@capital^City] :- 

result/country^C[@cia:capital^Name and city^City[@name=Name]]. 



cia-germany 
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(a) Element fusion - before 
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Fig. 5. (b) Element fusion - after 



Projection. When the integration and restructuring process is completed, projec- 
tions are used to define result views of the internal database. A result view is an 
XML tree, e.g., specified by a root node and a DTD. 

The complete case study in | |May 2001b| ) describes the process of data integration, 
data cleaning, data restructuring and distinguishing result tree views. The program 
is easily extendible by additional rules for adding another data source. 



6 Analysis, Related Work, and Conclusion 

6.1 Comparison with other XML Languages 

XPathLog vs. Requirements. In i|Fernandez et al. 1999J) . XQL, XML-QL, and the 
languages YATL l|Cluet et al. 1999jl and Lorel IjAbiteboul et al. 1997l[CToldman et al. 1999|l 
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are compared and essential features of an XML query language have been identified. 
XPathLog relates to their requirements as follows: 

• existence of some kind of pattern clause, step qualifier clause, and constructor 
clause: pattern and step qualifier clause are the same as in XPath, extended 
with variable bindings. The path patterns are superior to XML patterns (as 
e.g. used in XML-QL) since they allow for dereferencing and navigation along 
different axes. The constructor clause uses the same XPath-based syntax. 

• constructs for imposing nesting and order: nested elements in the result tree 
are generated by subsequent rules which stepwise generate the result. Group- 
ing (via stepwise generation) and order (via child(«)::name) is supported. 

• combining data from different sources is supported. 

• tag variables or regular path expressions: tag variables are supported, regular 
path expressions are not included in the basic XPathLog language (also not 
in XPath). They are definable as derived relations. 

• alternatives are expressible using a separate rule for each alternative. 

• checking for absence of information: existence or non-existence of properties 
can be tested using negation, e.g. //country[not @indep_date]. 

• external functions: aggregation, string functions and some data conversion is 
built-in; the set of functions is extensible. 

• navigation along references: implicit dereferencing is supported. 

Semistructured Data Languages. We have already mentioned the use of logic pro- 
gramming style languages in pre-XML projects on semistructured data in Section^ 

GmphLog IjConsens and M endelzon 1990) and F-Logic/FLORID IjKifer and Lausen~ 1989 
IKifer et al. 1995|) presented logic-programming languages over graph data models 
that cover the semistructured data model, but did not yet use that notion. 

In GraphLog, graphical queries are defined as patterns that are matched with an 
underlying graph database. The matched vertices are bound to variables that are 
then used for generating an output instance or for adding edges to the input graph 
in the rule head. In the graphical representation, the "rule head" is represented as 
a distinguished edge in the graphical pattern (to be added to the input graph). The 
language can be seen as a graphical representation of Datalog over binary relations. 
Thus, according to our criteria stated in Section fTTTI GraphLog qualifies as a logic- 
programming language. GraphLog excludes recursive rules, but allows for closure 
literals that represent the closure of a binary predicate; thus the expressiveness of 
the language is the same as for stratified linear Datalog. 

F-Logic (Kife r and Lausen 19891 IKifer et al. 1995J) is a deductive object-oriented 
database language that can be seen as an early concept of a semistructured, self- 
describing data model. F-Logic defines a data model, a logic, and a database query 
and programming language (similar to the relationship between the X-Structures, 
XPath-Logic and XPathLog). The experiences with F-Logic as a formal framework 
and as a language for data extraction and integration from the Web l|Ludascher et al.~1 998 
|May 1999j ) provided the background for the design of XPath-Logic and XPathLog 
as a crossbreed between XPath and F-Logic, combining the experiences with F- 
Logic as a successful (but "proprietary") language for data integration with the 
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standards of XML and XPath was a well-grounded evolution step. Especially, the 
power of the graph-based F-Logic data model compared with the restricted tree 
model of XML made up a central requirement in the design of XPathLog, leading 
to the XTreeGraph data model for virtual trees in a graph database. Another im- 
portant aspect taken from F-Logic is to have names as first-order citizens of the 
language for a seamless incorporation of metadata information. Due to these sim- 
ilarities, it was possible to base the implementation of XPathLog in the LoPiX 
system on the F-Logic system Florid. 

The OEM ( Object Exchange Model ) of the Tsimmis project l(Garcia-Molina et al. 19971 
lAbiteboul et al. 1997|) was the first data model that was dedicated explicitly to the 
notion of semistructured data. OEM is a graph based model, for which node-labeled 
and edge-labeled presentations have been given. With WSL and MSL (Wrap- 
per/Mediator Specification Language), Datalog-style programming languages have 
been presented. The Lorel language (McHug h et al. 1997| ) is similar to OQL, com- 
bining navigational access (extended with regular path expressions) with clauses. 
Lorel supports SQL-like, procedural update constructs. Lorel has been migrated 
to XML in [(Goldman et al. 1999 ). In contrast to the XPathLog/LoPiX migration, 
Lorel does not support the XML axes. 

UnQL i|Buneman et al. 19961 IBuneman et al. 2000(1 operates on rooted, edge- 
labeled graphs. It embeds graph schemata that are matched as patterns with the un- 
derlying database, combined with navigational access into SQL- like clauses. UnQL's 
semantics is based on structural recursion - similar to the later XSL. 

Strudel/StruQL ((Fernandez et al. 19971 IFernandez et al. 1998(1 also uses an edge- 
labeled graph model. Its syntax embeds query patterns that are matched with the 
underlying database into SQL- like clauses. StruQL rules specify what new elemen- 
tary structures are created, and what links between them are created. The Strudcl 
project has been continued for XML with XML-QL. 

The YATL language of the YAT system ((Cluet et al. 1999(1 is a pre- XML pro- 
posal, already using SGML and DTDs. Its trees provide a unified model for rela- 
tional, object-oriented (ODMG), and semistructured/document data (SGML). The 
YATL language follows a rule-based design for complex objects in the style of MSL 
or F-Logic; it supports regular path expressions and tree algebraic operations. In 
( |Christophides et al. 2000"| ), the YAT system is turned into an XML system for data 
integration, which still does not use any XML/XPath language constructs. After 
mapping an XML instance to a YAT tree, there is no notion of attributes. Derefer- 
encing is not explicitly supported, and it has no notion of the XML axes (similar 
to the same issue for XML-QL). 

XML Languages. XML-QL and XQuery embed XML patterns and XPath expres- 
sions, respectively, into SQL-style clauses. Expressions can be nested. 

XML-QL l(Deutsch et al. 1999(1 uses XML patterns in the head (CONSTRUCT) and 
body (WHERE) clause. In that aspect, it is the XML-pattern-counterpart to the 
XPath-based XPathLog. The XML-QL patterns for selecting elements do not sup- 
port the XML axes except the child axis, and indirectly the descendant by regular 
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path expressions. XML-QL does not support updates; a potential combination of 
XML patterns and updates is not obvious. 

XQuery (XQ uery 200 1J embeds XPath expressions in SQL-style FDR - LET - 
WHERE - RETURN clauses, where the RETURN clause specifies the result as an XML 
pattern. A proposal for specifying updates in XQuery has been published in (Tat arinov et al. 2001(1 . 
A more detailed proposal is described in IjLehti 2001|) and implemented in (Softw are AG 200T)l . 

XML-GL HCeri et al. 19991 IComai et al. 2001|l continued the idea of GraphLog 
for XML. In contrast to GraphLog, the rule body and the rule head are represented 
by separate graphs, called extract-match- construct- clip- queries. The rule heads gen- 
erate separate XML structures. Recursion is excluded. The MIX (Mediation in 
XML) system IIBaru et al. 1999)) uses the Xmas (XML Matching and Structuring) 
language, derived from XML-QL for data integration^ graphical user interface sim- 
ilar to XML-GL is provided. XDuce jHosoya and Pierce 2000) is a functional-style 
tree transformation language which uses regular expression pattern matching of 
(originally, SGML) DTDs for formulating queries against XML instances. 

Xcerpt ( |Bry and Schaffert 2002| ) is a pattern-based language for querying and 
transforming XML data. It follows a clean, rule-based design where the query 
(matching) part in the body is separated from the generation part in the rule head. 
XML instances are regarded as terms that are matched by a term pattern in the rule 
body, generating variable bindings. The semantics and the implementation is given 
by simulation unification that computes answer substitutions for the variables in 
the match pattern against the underlying XML term (similar to UnQL) . Then, the 
term in the rule head is instantiated with these variable bindings. Since rule heads 
have only a generating semantics, but not an update semantics, Xcerpt can only 
be used for querying and transforming XML data, but not for updating/extending 
an existing internal XML database. It has a rule-based semantics, but there is no 
global logic programming semantics for the evaluation of programs. 

Elog | |Baumgartner et al. 2001a| ) is a logic programming language for XML which 
is used as internal language for XML data extraction in the Lixto project ( |Baumgartner et al. 2 001 
It is based on flattening XML data into Datalog with specialized Web Access pred- 
icates. 

Table ^ gives a comparison of some of the above-mentioned languages. The 
"paradigm" column indicates the underlying semantics of the languages: the se- 
mantics of SQL-like languages is best given as an algebraic semantics that specifies 
the type and value of expression, allowing for nested expressions. For rule-based lan- 
guages, a denotational specification of the outcome of the right-hand side (query) 
and of the result of the left-hand side is required. Logic programming languages 
require both a model-theoretic semantics (to specify the outcome of rule heads, 
and for the global semantics), and an answer semantics for the querying part. 

6. 2 Contributions 

We have described XPath-Logic as a logic-based framework for handling XML data, 
together with an extended XML data model that is suitable for XML querying, ma- 
nipulation, and integration. XPathLog combines the intuitive "local" semantics of 
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addressing XML data by XPath with the appeal of the "global" logic programming 
semantics: it is completely XPath-based, i.e., both the rule bodies and the rule 
heads use an extended XPath syntax, thereby defining a constructive semantics for 
XPath expressions. Although the syntactic difference between XPath and XPath- 
Log is small, the extension adds much to the language by turning it into a data 
manipulation language. The close relationship with XPath ensures that its declar- 
ative semantics is well understood from the XML perspective. Since both XPath 
and rule-based programming by using variable bindings are well-known, intuitive 
concepts, the "effect" of the language is easy to understand on an intuitive basis, 
making programming easy. The logic programming background provides a strong 
theoretical foundation of the language concept. 

The data model and the language are implemented in the LoPiX system. Its 
practicability has been demonstrated by the Mondial case study. 
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Appendix A Proofs 

Proof of Thcorcmnand Lemma|21 The proof is done by structural induction. The 
enumeration is the same as in Definition 0] Below, (3 is an assignment of the pseudo 
variables Size and Pos (often even empty). We write = for "equals by definition 
in (Wadle r" 1999|) ". The individual items of the theorem are referred to below by 
IH1, . . . , IH4 (induction hypotheses). 

1. For closed, absolute expressions (i.e., without free variables), 

Sx{/expr) S a x y '{/ expr, any , 0) S[[/ expr]](x) for arbitrary x. 

2. Reference expressions ((Wadler 1999): only absolute expressions): 

S a x v {/p, any, ff) = f S a x v {p, root, ff) = 2 S an v [[/expr]] {root) . 

3. Axis step: 

S x ny {axis :: pattern, x, (3) = S x xls {pattern,x, [3) ! = 2 S axts [[/pattern]] (x) . 

4. The node test is the base case which is directly mapped to the axes: 

S a x {name, x, j3) ^= \is£{v, n)zAx{a,x){v 1 n = name) 
which is characterized in (Wadler 1999) (*4[[a]] enumerates the axes, V(a) 
gives the axes' principal nodetype) by 

{xi | xi G .4[[a]]x, nodetype{x\) — V(a), name(ii) = name} 
which is the definition of 5"[[name]](x). Note that dereferencing IDREF(S) 
and splitting NMT0KENS has been excluded, thus, the result list is still in doc- 
ument order. Similar (note that nodeQ is not defined in (Wadler 1999J, we 
extend the definition according to the XPath specification) 

S x (nodeQ, x, 0) ^= list( Vi7l ) e ^( 0ja .)(« | v G V) 

= {xi | xi s «4[[a]]a;, nodetype{x\) = element] — t S a [[node()]](x) 

S x {\exX{),x,(3) ™ list(„ i7l ) e ^(„ iiC )(« | v e V) 

= {x\ | xi e A[[a]]x, nodetype{xi) — Text} = 5 a [[text()]](a:) . 

5. Step with variable binding: obvious 
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6. Step qualifiers: 

S x (pattern[stepQ],x,f3) ^ f \'\sty eS * (patterns, p)(y I Qx(stepQ,y, /3p^ Size )) 

where L\ S x (pattern, x, (3) which equals S a [[pattern]](x,k 1 n) by induc- 
tion hypothesis IH3 and n := size(L±), and for every y, let j the index of y 
in L\ (which equals size({x\ \ x\ G L\,x\ <doc y})), k :— j if a is a forward 
axis, and k := n+l-j if a is a backward axis. This is the same as defined for 
S a [[pattern[stepQualifier]]](x) and, by induction hypothesis IH3, the same as 

IH3 

= tistyes%(pattern,x,/3){y \ Q[[stepQ]](,y,k,n)) = S a [[pattern[stepQ]]}(x) . 

7. Path: S%(jp 1 /p 2 , x, (3) = f concat ye5 a {pi X /3) (S x nv (p 2 , y, (3)) 

= 2 concat yeSa[lpi]]{x) (S a [[p 2 }}(y)) = S a [[ Pl /p 2 ]}(x) . 

8. Reference expressions (existential semantics) in step qualifiers: 

Qx(refExpr, x, /3) W S x ny (refExpr,x,0) ^ 

ig? ^cMw^^p^^) n Q[[ re /EBpr]](x s *r,n) . 

(for all fc, n since these are not used in refExpr). 

9. Predicates: (|Wadler 1999|l knows only the "=" comparison. The definition is 
although not complete: e.g. for step qualifiers of the form [a/b/c = "foo"] 
which are allowed in XPath, there is no semantics defined. We extend the 
semantics according to the XPath specification, applying either S or £ . 

Qx(pred(expn, . . . , expr n ), x, (3) 

W there are x\ G S x ny (expri, x, /3), . . . , x n G S x ny (expr n , x, (3) 
such that (x\, . . . , x n ) G 2p(pred) 

I¥ ^ 4 there are x\ G S chtld [[expri]](x) or x\ G £[[ea?pri]](x, f3(Pos), f3(Size)), . . . , 
x n G S child [[expr n ]](x) or x n G £[[expr n ]](x, (3(Pos), (3 (Size)) 

such that (xi, . . . , x„) G X(pred). 

10. - 1131 Boolean connectives and quantification, constants, and variables: obvi- 
ous. Functions are not defined in ( Wadlc r"l999fl . but the extension is obvious. 

n~TI Context-related functions use the extension of variable bindings by pseudo- 
variables Size and Pos in rule JfjJ): 

S x ny (pos\t\on(),x,(3) ™ f3(Pos) = £[[position()]](a:, l3(Pos), (3(Size)) 
Sx iy (\ast(),x,(3) ™ p(Size) = £[[\ast()]](x, f3(Pos), f3(Size)) . 

Proof of Lemma 0] 

Note: A bit sloppy, we write (x,(3) G SBx(expr) for "x G Kes(SBx(expr)) 
and (3 G Bdgs(5 B x (expr), x)" . 



1. For closed expressions, x G Res(S B x (refExpr)) <^> 

x G Res(SB x (refExpr, 0)) ™ x G S x (refExpr, 0) D 4^a; G Sx(refExpr) . 
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2. Reference expressions are translated into path expressions wrt. a start node: 

• entry points: rooted path 

(x, (3) G SB X (/P, Bdgs) *U (x,0)e SB a x v (p, root, Bdgs) 

™ x G S x ny (p,root, (3) and /3 completes some (3' G i?d<7s with free(/p)) 
°0^!r G S x lv (/p, (3) and /3 completes some /3' € Bdgs with free(/p)) . 

• entry points: constants c G V analogously (set c instead of root above). 

• entry points: variables V G Var. By definition, 

{x,0)eSB x (y/ Pl Bdg8) W 

(:r,/3) G concatje^^escendants.rootjj.^^B^Cp^jSdys cxi {V/a;})) 

which is exactly the case if there is an x G ^^(descendants, root) |i such 
that (a, /3) G (SB a x v (p, x, Bdgs x {V/a})) . 
By induction hypothesis, this is equivalent with 

a; G S^ iy (p, x, (3) and /? completes some /?' G Bdgs X {V/a;} with free(p) 

which is exactly the case if x = (3{V) and (3 completes some j3' G Bdgs 
with free(V/p). By Def. this again is equivalent with 



x G S x nv (V/p, 0) and j3 completes some f3' G Bdgs with free(V/p) 



3. Axis step: (ar, (3) G SB a x v (axis :: pattern, z, Bdgs) 



W (a:, /3) G SB^f" (pattern, z, Bdgs) 
™ x G S a x is (pattern, z, (3) 

and /3 completes some /?' G Bdgs with free (pattern) 

x G iS^.™ 1 ' (axis :: pattern, z, (3) 

and /3 completes some (3' G Bdgs with free (axis :: pattern) . 

4. Node test: (x,{3) G SB a x (name,z,Bdgs) W 

(x,/3) G list( Wi7l ) e ^( 0i ») ( n = nam e(v, {true} txi Bdgs) 
which is exactly the case if x G \'\&( Vl n)eAx{a,z), n=name( v ) an d (3 G Bdgs 
which, by Def. |3jis equivalent with x G S x (name, z, (3) and (3 completes some 
(3' G Bdgs with free(name) = 0. Analogously for node() and text(). 
Variables at nodetest position: 

(x,/3) G SB a x (N,z,Bdgs) <^> (x,/3) G list 

(u,ra)eAv(a,z) ( w i {-^V^} ^ Bdgs) 

which is exactly the case if x G \istr v n \^j^f a)Z -\(v) and (3 G {N/n} ix Bdgs 
which, by Def. ^ is equivalent with x G S x (N,z,[3) and /3 completes some 
P' G Bdgs with free(JV) = {A}. 

5. Step with variable binding: 



(sc, /?) G SB a x (pattern -> V, z, Bdgs) 

W (iC,/3) G \\St( yt £- )<B sB%(pa,ttern,z,Bdgs)(y,€ t< {V/y}) 

<=>■ there is a f3" s.t. (x, /?") G SB a x (pattern, z, Bdgs) and /? = /?" M {V/x} 
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By induction hypothesis, this is exactly the case if there is a (3" such that 
x £ S x (pattern, z, P") and (3" completes some j3' £ Bdgs with free (pattern), 
and = (3" IX {V/x}. Exactly then, since x — f3(V), by Definition 0] 
x £ S x (pattern — > V,z,f3) and (3 completes j3' with free(pattern — » V) = 
free(pattern) U {V}. 

6. Step Qualifier(s): (x, /3) £ (pattern [stepQuaT/fler], z, Bdgs) 

™ {x,f3) £\\st (iLaB x (ste P Quaimer,y,£)\{Pos,Size}) 

{y Go B x (patter n.z.^Bdgs), 

QB x (stepQualiEer,y,£')^i) 

for £ as defined in Definition I10l|^|l . This is exactly the case if (i) there is 
a (3" s.t. 13" £ QB x {stepQualiRer,x,£') and = [3" \ {Pos,Size}, and 
(ii) (x,£) G SB x (pattern, z, Bdgs) i.e., £ is the corresponding set of variable 
bindings, and (iii) QBx(stepQualifier, x, £') 7^ 0. 

The first item is by induction hypothesis equivalent to Q x (step Qualifier , x, /?") 
and /J" completes some (3' £ £' with free(step Qualifier) (*). 
The third item is redundant here (it avoids the addition of elements with 
empty bindings list to the result). Since (3" completes some (3 1 £ £' with 
free(,s£ep Qualifier), we know that 7 := f3' \ {Pos, Size} is an element of £. 
Specializing the second item to 7 yields (x, 7) £ S B x (patter n, z, Bdgs) . 
By induction hypothesis, x £ S x (pattern, z, 7) (**) and 7 completes some 
7' e £?dgs with free (pattern). Above, we derived 7 = /?' \ {-Pos, Size}. Using 
(*), since /?" is a completion of /3' with free(step Qualifier) , completing 7' £ 
Bdgs first to 7 (binding free(pattern)) , then to /?' (binding Size and Pos), 
then to /3" (binding free(step Qualifier)), we have Q x (step Qualifier ,y, (3") . 
From (**), since /?" completes 7, x 6 S x (pattern, z, (3") thus by Def. 0] 
the desired result x G S x (pattern[stepQualifier], z, Bdgs) for /3" which com- 
pletes 7' G Bdgs with free(pattern[stepQua7ifier]). 

The argumentation showed the "=>" direction (which is the more difficult 
direction since 7 must be guessed). "<^" uses the same relationships and 
variable bindings. 

7. Path: (x,/3) G SB a x (pi/p 2 ,z,Bdgs) 

™ (x,(3) £ concat {y 0eSB ^y (pi z Bdgs) SB a x (p 2 ,y,0 

there is an (y,£) G SB a x v (pi, z, Bdgs) s.t. (x,f3) £ SB a x (p2,y,C) 

™ there is a 7 G £ s.t. there is a 7' s.t. x G S x (p2, y, 7') and 
7' completes 7 with free(p2) • 

For this 7, (y, 7) G SB a x v (p\, z, Bdgs) and by induction hypothesis again 
y £ S x (pi, z,j) and 7 completes some (3' £ Bdgs with free(pi). Thus, also 
x G S£(p 2 ,y,7') and V S <S&(pi,2,7') and by Def.H £ £ S%(jpx/p 2 , z, 7')- 7' 
completes some (3 1 £ Bdgs with free(pi) U free(p2). 
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8. Reference expressions (existential semantics) in step qualifiers: 



P E QB X {refExpr, Z, Bdgs) ™ (3 E U{ w ,0eSB™» {refBxpr,z,Bdga) £ 

<^> there is a y s.t. (y, (3) E SB a ^ y '(refExpr , z, Bdgs) 

™ i/ G S^P V (refExpr , z, (3) 

and /3 completes some £ Bdgs with hee(refExpr) 



Qx(refExpr, z, (3) and /3 completes some /3' E -Bdgs with free(refExpr) 

9. Built-in equality predicate "=": similar to predicates and variable assignments 
by — > V. All other predicates: /3 E Q$x (pred(argi, . . . , arg n ), z, Bdgs) 

W /3 E U £i ixi ... lx £„ 

C^i -£i ) x ( ar 9i> z >Bdgs). (x\,...,x n )£iT(pred) 

there are (ii, £i), . . . , (x n , £„) s.t. (xj,&) e SB a ^ ly (arg i: z, Bdgs) 
and (xi, . . . , x„) E I(pred) and ,9 E £i IX . . . txl £„ 

<^> (take the right f3i E £,) 

there are (xi, (3i), . . . , (x n ,/3 n ) s.t. (xj, /%) E SB a ^ v (argi, z, Bdgs) 
and (xi, . . . , x n ) E I(pred) and /3 = /3i IX . . . IX (3 n 

™ there are (xi,/3i), . . . , (x n ,/3 n ) s.t. x, E S^ ly (arg i , z,(3i) 
and extends some /3- E Bdgs with free(argi) 
and (xi, . . . , x n ) E T(pred) and /3 = /3i x . . . x (3 n 
(the join guarantees that f3' := (3[ = . . . = f3' n holds) 
there are x\,...,x n s.t. Xi E S°^ v (argi, z, (3i) 
and (3 extends some j3' E Bdgs with free(argi) U . . . U free(arg n ) 

Qx(pred(argi : . . . , arg n ),z, (3) 

and /? completes some /?' E Bdgs with free(pred(argi, . . . ,arg n )) . 

10. Negated expressions which do noi contain any free variable: trivial. 

For negated expressions which contain free variables: Note that all variables 
in free(not expr) are required to be bound by Bdgs (safety). 

13 E QBx (not expr, z, Bdgs) 

(3 E Bdgs and there is no f3' E QBx(expr, z, Bdgs) s.t. (3 < (3 1 

™ /? E iWgs and there is no /3" such that 

Qx(expr, z, (3") and /3" extends with free(expr) and (3 < [3' 
Safety p ^ Bdgs and not Qx(expr, z, [3) 
D ||P /? e Bdgs and Qx( not expr, z, /?) . 
Conjunction: [3 E QS^(expri and expr^, z, Bdgs) 

?§- f /3 E QBx(expr\, z, Bdgs) x Q£>;f (expr2, z, QS^(expri, z, Bdgs)) 

™ there are 71 E QBx (expr 1, z, Bdgs) 

and 72 E Q,6;f (expri, z, Qi3^(expri, z, Bdgs)) s.t. Q^(expn, z, 71) 
and 71 completes some /3' E Bdgs with free(expri) and 
QA-(ea;pr2, z, 72) and 71 completes some 7" E QBx(expri, z, Bdgs) 
with free(expr2) and /? = 71 x 72. 
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(join condition: 71 = 7" < 72) Qx(expr 1,2,72) and Q^expr^, z, 72) 
and 72 completes some j3' G Bdgs with free(expri) U free(expr2) 
Qx(expr 1 and expr 2 , z, 72) 

and 72 completes some [3' G Bdgs with free(ea;pr 1 ) U free(expr 2 ) . 

11. — 1131 : trivial, (safety for variables; functions similar to predicates). 
ITTI Context-related functions use the extension of variable bindings by pseudo- 
variables Size and Pos in rule J^J: 

(x,/3) G SS^^positionO^, Bdgs) 
D 4 (x,f3) G list 0eBdgs (/3(F os ), {/?' G Sd ffS I /?(Pos) = f3'(Pos)}) 
<^> (3(Pos) = x for some /3 G Bdgs 
<^> x G <S^" y (position(), z, (3) for some /3 G Bdgs 
<f> x G <S^" y (position(), z, /3) 

and /3 completes some j3' G Bdgs by free(position()) (which is empty). 

Analogously for lastQ. 
Proof of Lemma [3 Structural induction. 

. entry case (using = [3'): (X, (3) \= /p °# (S x (Jp, (3)) ^ 

D # (S x (p, root, (3)) + °# (S x (root/p, (3)) ± °# (X, (3) \= root/p 
™ (X,/3) \= atom\ze(root/p) W /3) (= atomize(/p) . 

• Paths are resolved into steps and step qualifiers are isolated (the case where 
a don't care variable is introduced is shown; w.l.o.g., path is an absolute path 
expression) 

(X , [3) \= path/ 'axis :: nodetest[stepQualifier] /remainder 
4=> Sx(j>athj axis :: nodetest[step Qualifier] /remainder, (3) ^ 

Sxipath/ axis :: nodetest[step Qualifier] /remainder, root, (3) 7^ 

COnCa W£(patVa^:nodet eS t[^^ ^ 

there is a node u G S x (pathj 'axis :: nodetest[stepQualifier], root, (3) 
s.t. S x ny (remainder, v, (3) ^ 

there is a node u G Wst y e s%{path/axis::nodetest,x,p){v I Qx{stepQualifier,y, 13)) 

s.t. S x ny (remainder, v, (3) ^ 
<^ there is a node u s.t. v G S x (path/ axis :: nodetest, x, (3) 

and Qx (step Qualifier, v, (3) and S x ny (remainder, v, (3) ^ 
there is a node u s.t. u G S x (path/ axis :: nodetest — > _X, x, ) 

and Q x(V [step Qualifier], v, f3 v x ) and S x ny (V/ remainder, v,(3 v x ) ^ 
there is a node u s.t. t> G 5^.(pai/i[axis :: nodetest — > x, /^x) 

and Q x(V [step Qualifier], v, f3 v _ x ) and S x ny (V /remainder, v, f3^_ x ) ^ 
there is a node u s.t. iS^.(pai/i[axis :: nodetest — » _X ] , x, yS'V ) ^ 

and Q x(V [step Qualifier], v, [3^) and S a x y {V /remainder, x,f3 v _ x ) ^ 
<^> there is a node u s.t. Qx(path[axis :: nodetest — > Jf],/?^-) 

and Qx(V [step Qualifier], [3^) and Qx(V/ remainder, 
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™ there is a node v s.t. Qx{atom\ze(path[axis :: nodetest — » 

and Qa- (atomize( V [step Qualifier]), V _ X ) and (atomize(Vy remainder), 
there is a node u s.t. Q#(atomize(. . ■),0 v _ x ) ■ 
Conjunctions in step qualifiers: obvious. 

Predicates in step qualifiers: W.l.o.g., consider a unary predicate with a rela- 
tive argument expression: 

{X,0) \= V[pred(expr)} 

S x (V\pred(expr)],0(V),0)^9 

\ lst y£S%,(v,f3(v)jJ)(y I Qx(pred(expr),y,f3)) ^0 
^> {0{V) is the only clement in S X {V, 0{V), 0)) s.t. Q x (pred(expr), 0(V), 0) 

there is an x G S x (expr, 0(V), 0) such that pred(x) e X 
^=> there is an a; s.t. x G S x (V/expr — > _X,root, X _ X ) and (X,0* x ) \= pred(-X) 
<^> there is an a; s.t. (X, X _ X ) |= V/expr — > _X and (A", |= pred(_X) 
™ there is an x s.t. {X,0 X _ X ) |= atomizefV/expr — > _X) 

and (X,0* x ) ^pred(~X) 
<^> there is an x s.t. (X 7 0* x ) \= atom\ze(V [pred(expr)]) . 

Predicate atoms: analogous. 



